<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>spark Archives - Albert Nogués</title>
	<atom:link href="https://www.albertnogues.com/tag/spark/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.albertnogues.com/tag/spark/</link>
	<description>Data and Cloud Freelancer</description>
	<lastBuildDate>Fri, 31 May 2024 11:17:08 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://www.albertnogues.com/wp-content/uploads/2020/12/cropped-cropped-AlbertLogo2-32x32.png</url>
	<title>spark Archives - Albert Nogués</title>
	<link>https://www.albertnogues.com/tag/spark/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Data Quality Checks with Soda-Core in Databricks</title>
		<link>https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=data-quality-checks-with-soda-core-in-databricks</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Fri, 31 May 2024 11:17:06 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[soda]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=3439</guid>

					<description><![CDATA[<p>It&#8217;s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I&#8217;ve tested it within a databricks environment and it worked quite easily for me. For the examples of this article i am loading the customers table from the tpch delta tables in the &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/">Data Quality Checks with Soda-Core in Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>It&#8217;s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I&#8217;ve tested it within a databricks environment and it worked quite easily for me.</p>



<p>For the examples of this article i am loading the customers table from the tpch delta tables in the databricks-datasets folder.</p>



<p>First of all we need to install the library either scoped to our Databricks notebook or on our cluster. In my case i will install it notebook scoped:</p>



<pre class="wp-block-code"><code>%pip install soda-core-spark-df</code></pre>



<p>Then we create a dataframe from the tpch customers table:</p>



<pre class="wp-block-code"><code>#We create a table and read it into a dataframe
customer_df = spark.read.table("delta.`/databricks-datasets/tpch/delta-001/customer/`")</code></pre>



<p>We create a temporary view for our dataframe so soda can query the data and run the checks:</p>



<pre class="wp-block-code"><code>#We create a TempView
customer_df.createOrReplaceTempView("customer")</code></pre>



<p>And here it comes the whole soda core. We will define the checks using yaml syntax:</p>



<pre class="wp-block-code"><code>from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("Databricks Test Notebook")
scan.set_data_source_name("customer")
scan.add_spark_session(spark, data_source_name="customer")
#YAML Format
checks = '''
checks for customer:
  - row_count > 0
  - invalid_percent(c_phone) = 0:
      valid regex: ^&#91;0-9]{2}&#91;-]&#91;0-9]{3}&#91;-]&#91;0-9]{3}&#91;-]&#91;0-9]{4}$
  - duplicate_count(c_phone) = 0:
      name: No duplicate phone numbers
  - invalid_count(c_mktsegment) = 0:
      invalid values: &#91;HOUSEHOLD]
      name: HOUSEHOLD is not allowed as a Market Segment
'''
# you can use add_sodacl_yaml_file(s). Useful if the tests are in a github repo or FS
scan.add_sodacl_yaml_str(checks)
scan.execute()
print(scan.get_logs_text())</code></pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img fetchpriority="high" decoding="async" width="570" height="322" src="https://www.albertnogues.com/wp-content/uploads/2024/05/Output1.png" alt="" class="wp-image-3440" srcset="https://www.albertnogues.com/wp-content/uploads/2024/05/Output1.png 570w, https://www.albertnogues.com/wp-content/uploads/2024/05/Output1-300x169.png 300w, https://www.albertnogues.com/wp-content/uploads/2024/05/Output1-106x60.png 106w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>


<p>More info: <a href="https://docs.soda.io/soda/quick-start-databricks.html">Add Soda to a Databricks notebook | Soda Documentation</a></p>



<p>List of validations: <a href="https://docs.soda.io/soda-cl/validity-metrics.html">Validity metrics | Soda Documentation</a> and <a href="https://docs.soda.io/soda-cl/metrics-and-checks.html">SodaCL metrics and checks | Soda Documentation</a></p>



<p>We can somewhat enhance it and generate a Spark Dataframe all out of the list of our warnings or error validation checks:</p>



<pre class="wp-block-code"><code>from datetime import datetime
schema_checks = 'datasource STRING, table STRING, rule_name STRING, rule STRING, column STRING, check_status STRING, number_of_errors_in_sample INT, check_time TIMESTAMP'
list_of_checks = &#91;]
for c in scan.get_scan_results()&#91;'checks']:
    list_of_checks = list_of_checks + &#91;&#91;scan.get_scan_results()&#91;'defaultDataSource'], c&#91;'table'], c&#91;'name'], c&#91;'definition'], c&#91;'column'], c&#91;'outcome'], 0 if 'pass'in c&#91;'outcome'] else int(c&#91;'diagnostics']&#91;'blocks']&#91;0]&#91;'totalFailingRows']), datetime.strptime(scan.get_scan_results()&#91;'dataTimestamp'], '%Y-%m-%dT%H:%M:%S%z')]]
list_of_checks_df = spark.createDataFrame(list_of_checks,schema_checks)
display(list_of_checks_df)</code></pre>



<figure class="wp-block-image size-large is-resized"><img decoding="async" width="1024" height="403" src="https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-1024x403.png" alt="" class="wp-image-3441" style="width:840px;height:auto" srcset="https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-1024x403.png 1024w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-300x118.png 300w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-768x302.png 768w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-152x60.png 152w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput.png 1328w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>In the case we have the yaml file in our github repo, we can read it and pass it. Or If we are working with Databricks repos and the file is part of out repo we can load it locally</p>



<p>Accessing a remote file and reading it with requests:</p>



<pre class="wp-block-code"><code>#Trying to use a remote yaml file to enforce rules. We can upload it to a github of our own and use it in opur notebook.
#I've created a public repo so i dont need to authenticate to github, but in a real world scenario we should use private repo + secret scopes
customer_quality_rules = 'https://raw.githubusercontent.com/anogues/soda-core-quality-rules/main/soda-core-quality-rules-customer.yaml'
import requests
scan.add_sodacl_yaml_str(requests.get(customer_quality_rules).text)</code></pre>



<p>Or we can load it locally if we are using databricks repos:</p>



<pre class="wp-block-code"><code>scan.add_sodacl_yaml_file("your_file.yaml")</code></pre>
<p>The post <a href="https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/">Data Quality Checks with Soda-Core in Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Query Delta Tables in the DataLake from PowerBi with Databricks</title>
		<link>https://www.albertnogues.com/query-delta-tables-in-the-datalake-from-powerbi-with-databricks/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=query-delta-tables-in-the-datalake-from-powerbi-with-databricks</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Wed, 15 Nov 2023 18:47:00 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[powerbi]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=2195</guid>

					<description><![CDATA[<p>There are several ways to query delta tables from PowerBi. We are going to cover the 4th method here. To do it first we need a service princpal, a secret scope pointing to a databricks keyvault and the password of the SPN stored in this keyvault. Once we have this, the first step is to &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/query-delta-tables-in-the-datalake-from-powerbi-with-databricks/">Query Delta Tables in the DataLake from PowerBi with Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>There are several ways to query delta tables from PowerBi.</p>



<ul class="wp-block-list">
<li>You can use snowflake with external stages reading delta data from the DL,</li>



<li>The parquet connector and directly query the data from the datalake (caution as this method does not support SPNs and also you can only use delta tables with only 1 version)</li>



<li>or with Delta Sharing plugin (Not possible currently in danone due not having unity catalog)</li>



<li>Or the recommended way, using databricks to do it (With a SPN + Databricks CLuster (either DataEngineering or SQL Warehouse))</li>
</ul>



<p>We are going to cover the 4th method here. To do it first we need a service princpal, a secret scope pointing to a databricks keyvault and the password of the SPN stored in this keyvault.</p>



<p>Once we have this, the first step is to set up the cluster with the credentials to access the datalake. For this we need to configure the spark variables of our databricks cluster. You can follow the guide <a href="https://learn.microsoft.com/en-us/azure/databricks/getting-started/connect-to-azure-storage">here</a>.</p>



<p>After that you cluster should have the credentials in the spark conf section, something like this:</p>



<figure class="wp-block-image size-large"><img decoding="async" width="677" height="1024" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-1-677x1024.png" alt="" class="wp-image-2197" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-1-677x1024.png 677w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-1-198x300.png 198w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-1-768x1161.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-1-40x60.png 40w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-1.png 979w" sizes="(max-width: 677px) 100vw, 677px" /></figure>



<p>The second step is creating an EXTERNAL table in Databricks to pint to our delta table(s). For this we connect to our databricks workspace with the previous configuration, and create a new notebook and define the external tables we want, something like this:</p>



<pre class="wp-block-code"><code>%sql
CREATE TABLE IF NOT EXISTS anogues.customers_external 
LOCATION 'abfss://raw@albertdatabricks001.dfs.core.windows.net/customers'</code></pre>



<p>Make sure the location is the right one otherwise when we query the data we will get either an error or no results.</p>



<p>Once we have this we can query our external table and verify we can see the data:</p>



<pre class="wp-block-code"><code>%sql
select * from anogues.customers_external LIMIT 5;</code></pre>



<p>And providing we did it right we should see the data. There is no need to define the table columns as delta uses parquet under the hood so it’s a self contained format where the schema is stored alongside with the data.</p>



<p></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="531" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-1024x531.png" alt="" class="wp-image-2198" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-1024x531.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-300x156.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-768x398.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-1536x796.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556-116x60.png 116w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142556.png 1682w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Once we confirmed this is working we can go to powerbi, try to import data using the Databricks Connector:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="349" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627-1024x349.png" alt="" class="wp-image-2199" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627-1024x349.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627-300x102.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627-768x262.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627-176x60.png 176w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142627.png 1214w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To configure the connector we need to get some details from our cluster. These can be found in the advanced options of our cluster, in the tab JDBC/ODBC</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="966" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733-1024x966.png" alt="" class="wp-image-2200" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733-1024x966.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733-300x283.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733-768x726.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733-64x60.png 64w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142733.png 1228w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>On the following screen we need to select our authentication options to connect to the databricks cluster. Since we have SAML + SCIM enabled in our workspaces the user and password option is not possible. Either we need a databricks PAT token or use Azure AD. I recommend the latter:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="507" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917-1024x507.png" alt="" class="wp-image-2201" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917-1024x507.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917-300x149.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917-768x381.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917-121x60.png 121w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-142917.png 1235w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>We click on it and select our AAD account. If all works well our session will be started. We should see it in the screen:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="506" src="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025-1024x506.png" alt="" class="wp-image-2202" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025-1024x506.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025-300x148.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025-768x379.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025-121x60.png 121w, https://www.albertnogues.com/wp-content/uploads/2023/11/image-20230817-143025.png 1223w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then we click on connect, and we can see our data. Since we don’t have unity catalog, our table should appear in the hive_metastore catalog. There we can find our database and inside our table(s). We click and either load all the tables we want or start transforming them inside powerbi, like we will do with any other data source.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="867" height="689" src="https://www.albertnogues.com/wp-content/uploads/2023/11/pbi.png" alt="" class="wp-image-2208" srcset="https://www.albertnogues.com/wp-content/uploads/2023/11/pbi.png 867w, https://www.albertnogues.com/wp-content/uploads/2023/11/pbi-300x238.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/11/pbi-768x610.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/11/pbi-76x60.png 76w" sizes="auto, (max-width: 867px) 100vw, 867px" /></figure>



<p>For more help here is the detail of the powerbi databricks connector: <a href="https://learn.microsoft.com/en-us/azure/databricks/partners/bi/power-bi#--connect-power-bi-desktop-to-azure-databricks-manually">Connect Power BI to Azure Databricks &#8211; Azure Databricks | Microsoft Learn</a></p>
<p>The post <a href="https://www.albertnogues.com/query-delta-tables-in-the-datalake-from-powerbi-with-databricks/">Query Delta Tables in the DataLake from PowerBi with Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Databricks query federation with Snowflake. Easy and Fast!</title>
		<link>https://www.albertnogues.com/databricks-query-federation-with-snowflake-easy-and-fast/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=databricks-query-federation-with-snowflake-easy-and-fast</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Tue, 31 Jan 2023 12:15:05 +0000</pubDate>
				<category><![CDATA[BigData]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Snowflake]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[snowflake]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1757</guid>

					<description><![CDATA[<p>Introduction In the same way that is possible to read and write data from snowflake inside databricks, its also possible to use databricks with query federation against diverse SQL engines, including snowflake. The current supported engines are: We are going to demonstrate how it works with Snowflake. We will first create a table in databricks, &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/databricks-query-federation-with-snowflake-easy-and-fast/">Databricks query federation with Snowflake. Easy and Fast!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Introduction</h2>



<p>In the same way that is possible to read and write data from snowflake inside databricks, its also possible to use databricks with query federation against diverse SQL engines, including <a href="http://www.snowflake.com" target="_blank" rel="noopener" title="snowflake">snowflake</a>. The current supported engines are:</p>



<ul class="wp-block-list">
<li><a href="https://docs.databricks.com/query-federation/postgresql.html">PostgreSQL</a></li>



<li><a href="https://docs.databricks.com/query-federation/mysql.html">MySQL</a></li>



<li><a href="https://docs.databricks.com/query-federation/snowflake.html">Snowflake</a></li>



<li><a href="https://docs.databricks.com/query-federation/redshift.html">Redshift</a></li>



<li><a href="https://docs.databricks.com/query-federation/synapse.html">Synapse</a></li>



<li><a href="https://docs.databricks.com/query-federation/sql-server.html">SQL Server</a></li>
</ul>



<p>We are going to demonstrate how it works with Snowflake. We will first create a table in databricks, it can be a delta table stored in the data lake, or an unmanaged table pointing to a set of files (External Table) or anything in between.</p>



<h2 class="wp-block-heading">Creating the required resources</h2>



<p>I will go to databricks and run the following:</p>



<pre class="wp-block-code"><code>CREATE OR REPLACE TABLE default.DATABRICKS_ALBERT(
NAME STRING,
SEX STRING);

INSERT INTO default.DATABRICKS_ALBERT(NAME, SEX) VALUES ('Albert','Male');</code></pre>



<p>This process can either be done from the Data Engineering persona or the SQL persona. I created it first from the data engineering part.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="558" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-3-1024x558.png" alt="" class="wp-image-1762" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-3-1024x558.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-3-300x164.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-3-768x419.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-3-110x60.png 110w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-3.png 1332w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Now we go to snowflake and create a second table which we want to join:</p>



<pre class="wp-block-code"><code>CREATE TABLE SANDBOX.DEFAULT.DATABRICKS_FEDERATED(
NAME STRING,
AGE INTEGER);

INSERT INTO SANDBOX.DEFAULT.DATABRICKS_FEDERATED (NAME, AGE) VALUES ('Albert', 37);</code></pre>



<p>And it gets created succesfully:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="433" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-1-1024x433.png" alt="" class="wp-image-1759" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-1-1024x433.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-1-300x127.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-1-768x325.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-1-142x60.png 142w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-1.png 1189w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Now we can stay in the data engineering persona or go to databricks SQL and we create the virtual table that will create the link with the table in snowflake. This will do the mapping:</p>



<pre class="wp-block-code"><code>CREATE TABLE MY_TABLE
USING snowflake
OPTIONS(
dbtable 'YourTable',
sfURL 'yourURL.snowflakecomputing.com', --You can use privatelink if you have one
sfUser 'YourUser',
sfPassword 'YourPassword',
sfDatabase 'YourDB', --You can use a secret scope like: secret('scope_name', 'pwd_entry'),
sfSchema 'YourSchema',
sfWarehouse 'YourSnowflakeWarehouse'
);</code></pre>



<p>In the Databricks SQL:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="482" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-1024x482.png" alt="" class="wp-image-1763" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-1024x482.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-300x141.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-768x361.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-1536x723.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-4-127x60.png 127w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-4.png 1891w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>In The Data Engineering:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="646" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-1024x646.png" alt="" class="wp-image-1764" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-1024x646.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-300x189.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-768x484.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-1536x968.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-5-95x60.png 95w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-5.png 1702w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<h2 class="wp-block-heading">Glueing it up together</h2>



<p>If you use Databricks SQL I couldn’t make it work with the STARTER ENDPOINT so be sure to use a normal WAREHOUSE to avoid any errors. In my case i created an XS warehouse, but now i can run a query fetching data from both tables:</p>
</blockquote>



<pre class="wp-block-code"><code>select a.name, a.sex, b.age
FROM default.DATABRICKS_ALBERT a , default.SNOWFLAKE_ALBERT b
where a.name=b.name</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="884" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-6-1024x884.png" alt="" class="wp-image-1765" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-6-1024x884.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-6-300x259.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-6-768x663.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-6-70x60.png 70w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-6.png 1081w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>And in the Data Engineering Persona:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="679" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-1024x679.png" alt="" class="wp-image-1766" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-1024x679.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-300x199.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-768x509.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-1536x1018.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-7-91x60.png 91w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-7.png 1785w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>And of course you can use spark.sql with python or any other language to query the table as well:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="383" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-1024x383.png" alt="" class="wp-image-1767" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-1024x383.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-300x112.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-768x288.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-1536x575.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-8-160x60.png 160w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-8.png 1811w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Or directly with Dataframes treating the federated table as any other table in the catalog:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="301" src="https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-1024x301.png" alt="" class="wp-image-1768" srcset="https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-1024x301.png 1024w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-300x88.png 300w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-768x226.png 768w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-1536x452.png 1536w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-9-204x60.png 204w, https://www.albertnogues.com/wp-content/uploads/2023/01/image-9.png 1856w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Hope this clears your way and helps you integrate data from different sources without having to use a virtual metadata layer.</p>
<p>The post <a href="https://www.albertnogues.com/databricks-query-federation-with-snowflake-easy-and-fast/">Databricks query federation with Snowflake. Easy and Fast!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Smallest Analytical Platform Ever!</title>
		<link>https://www.albertnogues.com/smallest-analytical-platform-ever/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=smallest-analytical-platform-ever</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Sat, 07 May 2022 08:38:12 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Spark]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[spark]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1440</guid>

					<description><![CDATA[<p>I&#8217;ve started working on some of my free time in a project to build the smallest useful analytics platform on the cloud (starting with azure). The purpose is to use it a sa PoC to show to colleagues, managers, prospective customers or just to have fun and play It&#8217;s publicly available on my github repo &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/smallest-analytical-platform-ever/">Smallest Analytical Platform Ever!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>I&#8217;ve started working on some of my free time in a project to build the smallest useful analytics platform on the cloud (starting with azure).</p>



<p>The purpose is to use it a sa PoC to show to colleagues, managers, prospective customers or just to have fun and play</p>



<p>It&#8217;s publicly available on my github repo and any collaboration is welcome. You can fork it, improve it, send PR&#8217;s and do whatever you want!</p>



<p>The first version will run solely on azure. The objective is to show the following technologies/disciplines:</p>



<p>* Infrastructure as a Code (IaaC), by using Terraform</p>



<p>* Cloud architecture anc Cloud Ops by using an azure cloud environment</p>



<p>* Data Engineering by using a Spark powered Databricks Notebook and an ADF Pipeline (future)</p>



<p>* DevOps to trigger some pipelines based on changes (future)</p>



<p>* Basic Security concepts (keyvault, service principals, least privileged rbac accesses&#8230;)</p>



<p>* FinOps keeping the costs at minimum and choosing the proper tools for the job</p>



<p>* Reporting and Dashboarding on data in the platform</p>



<p>* Data management: We will use an adls storage account and azure sql db</p>



<p>TOOLS:</p>



<p>* Terraform to deploy all the infra as a code</p>



<p>* Azure Cloud to host our resources</p>



<p>You have the code plus all the information on my github repo:</p>



<p><a href="https://github.com/anogues/ProjectZ">https://github.com/anogues/ProjectZ</a></p>



<p></p>
<p>The post <a href="https://www.albertnogues.com/smallest-analytical-platform-ever/">Smallest Analytical Platform Ever!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Implementing CI/CD in Databricks with Azure DevOps (Part 1)</title>
		<link>https://www.albertnogues.com/implementing-ci-cd-in-databricks-with-azure-devops-part-1/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=implementing-ci-cd-in-databricks-with-azure-devops-part-1</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Sat, 30 Apr 2022 15:10:29 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Spark]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[spark]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1416</guid>

					<description><![CDATA[<p>There are many ways to implement some CI/CD with Databricks. We can use Azure DevOps, Github+Github Actions or any other combination of tools, including the dbx tool. But an easy way to just copy notebooks between workspaces can be implemented easily with Azure DevOps. We are going to use the git repos capability of Azure &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/implementing-ci-cd-in-databricks-with-azure-devops-part-1/">Implementing CI/CD in Databricks with Azure DevOps (Part 1)</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>There are many ways to implement some CI/CD with Databricks. We can use Azure DevOps, Github+Github Actions or any other combination of tools, including the <a href="https://dbx.readthedocs.io/en/latest/templates/python_basic.html#project-file-structure" target="_blank" rel="noreferrer noopener">dbx tool</a>.</p>



<p>But an easy way to just copy notebooks between workspaces can be implemented easily with Azure DevOps.</p>



<p>We are going to use the git repos capability of Azure Databricks, so when a new code change is commited in a notebook an Azure DevOps pipeline will trigger the transport copy of the workbook from the first Databricks workspace (in our case a NonProd workspace) to the target one, again, in our case, the Prod workspace.</p>



<p>To achieve this we will use some more components from the Azure ecosystem, including the use of Keyvaults to keep all our secrets stored safely. The list of prerequisites is the following:</p>



<ul class="wp-block-list"><li>Two Databricks workspaces, one our source workspace (NonProd) and another, our Production one.</li><li>An Azure Keyvault (or two if we want to segregate the environments)</li><li>Azure Databricks repository configured at least in our source workspace, so when the change is commited we can triger the pipeline that will fetch the notebook and transport it to the prod workspace</li><li>Access to Azure DevOps (Something similar can be implemented with Github + Github Actions)</li></ul>



<p>Lets see how to implement it. First we need to make sure git repos in enabled our source workspace. We can verify by login with an admin privileged user to our workspace and make sure the option is checked as follows:</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1-924x1024.png" alt="" class="wp-image-1418" width="687" height="760" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1-924x1024.png 924w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1-271x300.png 271w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1-768x851.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1-54x60.png 54w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD1.png 1072w" sizes="auto, (max-width: 687px) 100vw, 687px" /><figcaption>Fig 1. Make sure that github repos is enabled in our workspace.</figcaption></figure>



<p>Secondly, we go to <a href="https://azure.microsoft.com/en-us/services/devops/" target="_blank" rel="noreferrer noopener">Azure DevOps services</a> and we create a new project. I&#8217;ve called it DatabricksCICD but feel free to call it whatever you need:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="714" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2-1024x714.png" alt="" class="wp-image-1419" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2-1024x714.png 1024w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2-300x209.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2-768x536.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2-86x60.png 86w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD2.png 1531w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Fig 2. Create a new repo and initialize it</figcaption></figure>



<p>Once created we take the details for cloning our repo and copying them. We go to our databricks workspace and then we look for the Repos option on the left, and add a new repository. We need to paste the url to clone our newle Azure DevOps created repository:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="319" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-1024x319.png" alt="" class="wp-image-1420" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-1024x319.png 1024w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-300x93.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-768x239.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-1536x478.png 1536w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-2048x637.png 2048w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD3-193x60.png 193w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Fig 3. Cloning our Azure DevOps Repository</figcaption></figure>



<p>Once we linked our Databricks workspace with our DevOps repo, now we can create a new notebook. In the same Repos section, click on the down arrow to create a new notebook, as shown below:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="889" height="373" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD4.png" alt="" class="wp-image-1421" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD4.png 889w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD4-300x126.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD4-768x322.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD4-143x60.png 143w" sizes="auto, (max-width: 889px) 100vw, 889px" /><figcaption>Fig 4.  Creating a new notebook.</figcaption></figure>



<p>The content of the notebook, you can put anything you want. I&#8217;m writing a print(&#8220;Hello from Albert&#8221;) statement. We will not run it, we just want to show it&#8217;s possible to transport it. Once done, click on the save now in the revision tab:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="73" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-1024x73.png" alt="" class="wp-image-1423" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-1024x73.png 1024w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-300x21.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-768x55.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-1536x110.png 1536w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-2048x146.png 2048w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD5-600x43.png 600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Fig 5. Saving our changes to the notebook.</figcaption></figure>



<p>Then click on the left in the main branch button, from there we will be commiting the changes to our repository:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="314" src="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-1024x314.png" alt="" class="wp-image-1424" srcset="https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-1024x314.png 1024w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-300x92.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-768x235.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-1536x470.png 1536w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-2048x627.png 2048w, https://www.albertnogues.com/wp-content/uploads/2022/04/DatabricksCICD6-196x60.png 196w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Fig 6. Pushing our notebook to the DevOps repo</figcaption></figure>



<p>If we go back now to our Azure DevOps project we should see the file has been commited to the repository. This ends the first part of this tutorial.</p>



<p>In the second blog entry we will see how to trigger the pipeline after a modification of this notebook and passing the credentials of the second workspace to be able to deliver the changed notebook to our Production (target) workspace.</p>
<p>The post <a href="https://www.albertnogues.com/implementing-ci-cd-in-databricks-with-azure-devops-part-1/">Implementing CI/CD in Databricks with Azure DevOps (Part 1)</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Databricks cluster policies at a glance. The easy way!</title>
		<link>https://www.albertnogues.com/databricks-cluster-policies-at-a-glance-the-easy-way/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=databricks-cluster-policies-at-a-glance-the-easy-way</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Tue, 08 Feb 2022 17:30:00 +0000</pubDate>
				<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[spark]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1318</guid>

					<description><![CDATA[<p>For these administering one or more databricks workspaces, cluster policies are an important tool where we spend some time with. Introduction But what are cluster policies? Cluster policies are basically a json file with some parameters that we use to allow (or not) users to select certain things when creating a cluster. Not only users &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/databricks-cluster-policies-at-a-glance-the-easy-way/">Databricks cluster policies at a glance. The easy way!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>For these administering one or more databricks workspaces, cluster policies are an important tool where we spend some time with.</p>



<h2 class="wp-block-heading" id="introduction">Introduction</h2>



<p>But what are cluster policies?</p>



<p>Cluster policies are basically a json file with some parameters that we use to allow (or not) users to select certain things when creating a cluster. Not only users select (or deselect) but we can force some parameters of the cluster as well by default.</p>



<p>The purpose of using cluster policies is not only standarizing of forcing certain specific configurations but also to limit human error that can cost the copany lots of money by capping certain parameters to only allow specific machine sizes, maximum number of nodes, or the cluster timeout.</p>



<p>To be able to use cluster policies, you need to have a Premium workspace. And of course, be an admin of the workspace to be able to define them.</p>



<h3 class="wp-block-heading" id="format-of-a-cluster-policy-and-it-s-elements">Format of a cluster policy and it&#8217;s elements</h3>



<p>The format of a policy, as we said, its a json document:</p>



<pre class="wp-block-code"><code>interface Policy {
  &#91;path: string]: PolicyElement
}</code></pre>



<p>The type of policy elements we can use to control its quite long, and you have them explained in the official databricks documentation:</p>



<ul class="wp-block-list"><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#fixed-policy">Fixed policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#forbidden-policy">Forbidden policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#limiting-policies-common-fields">Limiting policies: common fields</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#allow-list-policy">Allow list policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#block-list-policy">Block list policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#regex-policy">Regex policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#range-policy">Range policy</a></li><li><a href="https://docs.databricks.com/administration-guide/clusters/policies.html#unlimited-policy">Unlimited policy</a></li></ul>



<p>In this article, we will create a simple policy, that will preconfigure a cluster based on a specific machine size, we will restrict the maximum number of nodes to 5, and that will autotag the cluster with a specific key that we will define in the policy.</p>



<p>However there are endless possibilities so I recommend you to have a look at the official documentation to have yourself an idea of all the parameters you can configure <a href="https://docs.databricks.com/administration-guide/clusters/policies.html" target="_blank" rel="noreferrer noopener">here</a>.</p>



<h2 class="wp-block-heading" id="creating-a-custom-cluster-policy">Creating a custom cluster policy</h2>



<p>Ok, lets start! To create our first policy we need to log in into our workspace, go to the compute section and click on the cluster policies tab:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="980" height="586" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies1.png" alt="" class="wp-image-1321" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies1.png 980w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies1-300x179.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies1-768x459.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies1-100x60.png 100w" sizes="auto, (max-width: 980px) 100vw, 980px" /><figcaption>Fig 1. Creating a Cluster Policy on Azure Databricks</figcaption></figure>



<p>Then, there if we have rights (i.e. we are administrators of the workspace) we should see a buton called Create Cluster Policy. Once clicked we will see something similar to Fig 2:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="673" height="505" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies2.png" alt="" class="wp-image-1322" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies2.png 673w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies2-300x225.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies2-80x60.png 80w" sizes="auto, (max-width: 673px) 100vw, 673px" /><figcaption>Fig 2. A Cluster Policy</figcaption></figure>



<p>Here we have to concentrate on three things. The first one, is the policy name. This is the name your users will see, so I recommend to choose a meaningful name. For example multinode small cluster, or something like that.</p>



<p>Once the name is selected, we need to actually define the policy. This is a json area content box. In my sample case i want a to create a policy that chooses by default a machine with a small/medium size sku and maximum can autoscale to 5 nodes. Also i want to tag the cluster with a fixed string. To go further on this example, i will give the user a choice of two machines for the nodes, while the driver will be restricted to a specific vm sku. I will also set up auto termination to 10 minutes to make sure i am not paying for something not in use. My policy will look like the following:</p>



<pre class="wp-block-code"><code>{
  "node_type_id": {
    "type": "allowlist",
    "values": &#91;
      "Standard_D8d_v4",
      "Standard_D16d_v4"
    ],
    "defaultValue": "Standard_D8d_v4"
  },
  "driver_node_type_id": {
    "type": "fixed",
    "value": "Standard_D8d_v4",
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "autoscale.max_workers": {
    "type": "range",
    "maxValue": 5,
    "defaultValue": 2
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
  "custom_tags.project": {
    "type": "fixed",
    "value": "Albert"
  }
}</code></pre>



<p>Once defined i will click on save. If the format of the policy is ok, the cluster policy will be created. I can then go to the permissions tab and assign it to users or groups. By default the policy is only assigned to be used by the admin group:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="833" height="487" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies3.png" alt="" class="wp-image-1323" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies3.png 833w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies3-300x175.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies3-768x449.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies3-103x60.png 103w" sizes="auto, (max-width: 833px) 100vw, 833px" /><figcaption>Fig 3. Permissions for a policy</figcaption></figure>



<p>We will leave it as it is as we are just playing arround.</p>



<p>We can now try to create a cluster using this policy:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1003" height="937" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies4.png" alt="" class="wp-image-1325" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies4.png 1003w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies4-300x280.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies4-768x717.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies4-64x60.png 64w" sizes="auto, (max-width: 1003px) 100vw, 1003px" /><figcaption>Fig 4. Create a cluster with the new policy</figcaption></figure>



<p>Select the newly created policy, then in the worker type by default in the policy we are choosing the machine sku Dtandard_D8d_v4 but we are allowing the user to chose a bigger one, and in the autoscaling we can choose up to 5 nodes. If we try to input 6 we will get an error: &#8220;Max Workers cannot be more than 5&#8221; and will not let us go through the cluster creation.</p>



<p>The driver type and the default cluster timeout are not seen as we dont allow the user to change it, they are set by default and we choose to mak them hidden. By removing the hidden attribute we will let the users to see the values but not modifying it:</p>



<pre class="wp-block-code"><code>"driver_node_type_id": {
    "type": "fixed",
    "value": "Standard_D8d_v4",
    <strong>"hidden": true</strong>
  },
...
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    <strong>"hidden": true</strong>
  },</code></pre>



<p>And the last needed to be check is in the advanced options. We wanted to autotag the cluster with the project tag with a set string. If we uncollapse the advanced options we can verify how this tag is applied:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="790" height="606" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies5.png" alt="" class="wp-image-1326" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies5.png 790w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies5-300x230.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies5-768x589.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies5-78x60.png 78w" sizes="auto, (max-width: 790px) 100vw, 790px" /><figcaption>Fig 5. Project tag in advanced options</figcaption></figure>



<p>And if we go to our cloud provider and we check the machines created as part of the cluster we will see how the tag has also been propagated to the cloud resources. This will allow us to do cost analysis from the provider cloud portal.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="305" src="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-1024x305.png" alt="" class="wp-image-1328" srcset="https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-1024x305.png 1024w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-300x89.png 300w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-768x229.png 768w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-1536x458.png 1536w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-2048x610.png 2048w, https://www.albertnogues.com/wp-content/uploads/2022/02/ClusterPolicies6-201x60.png 201w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Fig 6. Project Tag propagated to the underlying cloud resource (VM).</figcaption></figure>



<h2 class="wp-block-heading" id="next-steps">Next Steps</h2>



<p>There is also another very good article about defining the strategy when implementing policies. As this process can lead to some errors its important not to enforce the policies directly in a production workspace before testing. Databricks has created a reference methodology that i think it makes perfect sense when implementing these policies, starting by testing the policy, then create a barebone policy to do some tagging, then deploy the real policy but without enforcing it to any user (so it can only be selected optionally) and once tested everything is working as expected then enforce it to the users.</p>



<p>By using this framework you can simply correct all aspects that may be wrong before creating a potential issue to the final users.</p>



<p>There is also an interesting section in that article which is a worry usually for all of us that work for large user base. This is the tag enforcement of resources, very important when crosscharging between teams or different parts of the organization is required. In order to be able to input the costs to the proper project, department or other body, you need to enforce tagging of your resources. This is a very important topic when working on the cloud and while its a bit difficult to implement a proper tagging policy in databricks as you have mainly to rely on freeform text or some regex expression, you can try to build an effective system to crosscharge your projects or departments in the use of your databricks workspace(s).</p>



<p>This set of best practices and challenges is documented <a href="https://docs.databricks.com/administration-guide/clusters/policies-best-practices.html" target="_blank" rel="noreferrer noopener">here</a> and for me its an important resource for any databricks administrator.</p>



<p>Have fun!</p>
<p>The post <a href="https://www.albertnogues.com/databricks-cluster-policies-at-a-glance-the-easy-way/">Databricks cluster policies at a glance. The easy way!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Using Azure Private Endpoints with Databricks</title>
		<link>https://www.albertnogues.com/using-azure-private-endpoints-with-databricks/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-azure-private-endpoints-with-databricks</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Thu, 09 Dec 2021 19:31:32 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Spark]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[PrivateEndpoints]]></category>
		<category><![CDATA[spark]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1235</guid>

					<description><![CDATA[<p>In this article i will show how to avoing going outside to the internet when using resources inside azure, specially if they are in the same subscription and location (datacenter). Why we may want a private endpoint? Thats a good question. For oth security and performance. Just like using TSCM Equipment for optimal safety and &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/using-azure-private-endpoints-with-databricks/">Using Azure Private Endpoints with Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In this article i will show how to avoing going outside to the internet when using resources inside azure, specially if they are in the same subscription and location (datacenter).</p>



<p>Why we may want a private endpoint? Thats a good question. For oth security and performance. Just like using <a href="https://spyassociates.com/counter-surveillance">TSCM Equipment</a> for optimal safety and security. We dont want the traffic going outside to the internet to return again back to the azure datacenter if the resource we are trying to reach is already there. So with a PrivateLink the traffic will stay inside the Azure backbone network avoiding reaching the internet. More information about private endpoints <a href="https://azure.microsoft.com/en-us/services/private-link/" target="_blank" rel="noreferrer noopener">here</a> and <a href="https://docs.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noreferrer noopener">here</a>.</p>



<p>Though its possible to create private endpoints to connect to services in other subcriptions we will use the same subscription and the West Europe Region in this article. The goal is to connect to both a AzureSQL database using private connectivity and to a datalake using private connectivity as well.</p>



<h2 class="wp-block-heading">Creating a Private Endpoint for AzureSQL and integrating in the databricks vnet</h2>



<p>For this, i created a Databricks workspace and selected to use an already existing VNET, so this way I can add a new subnet for my private endpoints. One of the good things of doing this way is that NICs between subnets see each other and are reacheable (unless we block it with a network security group) but by default traffic is open within the VNET. So I can create a Private endpoint in a specific subnet of the same VNET that hosts the databricks subnets.</p>



<p>Bear in mind that it&#8217;s not possible to add a private endpoint to a subnet managed by databricks. So the two subnets we created, when we deployed the databricks workspace (Bot public and private) should not be modified. We will create a new one as shown in the screen below:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="145" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-1024x145.png" alt="" class="wp-image-1236" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-1024x145.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-300x43.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-768x109.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-1536x218.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-2048x290.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep4-424x60.png 424w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Our  Databricks VNET. Among the two subnets created when the databricks workspace is created i added a new one to host our Private Endpoints</figcaption></figure>



<p>Once defined properly the VNET, we are going to create the private endpoint to reach our AzureSQL through it.</p>



<p>First we need to go to the Azure Portal, find our AzureSQL Server, and click on the left menu called Private Endpoint Connections and click on the plus sing on top to create a new one. We just need to select the subscription, the resource group the nameof the private endpoint and the region. We can fill it as shown in the following picture:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="946" height="626" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep1.png" alt="" class="wp-image-1237" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep1.png 946w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep1-300x199.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep1-768x508.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep1-91x60.png 91w" sizes="auto, (max-width: 946px) 100vw, 946px" /><figcaption class="wp-element-caption">Private Endpoint Creation. Step 1</figcaption></figure>



<p>The second step requires a bit more of information, here we will define which resource we try to target with our private endpoint. As expected we need to find our AzureSQL Server here. We fill the combo boxes as usual</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="920" height="529" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2.png" alt="" class="wp-image-1238" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2.png 920w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2-300x173.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2-768x442.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2-330x190.png 330w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep2-104x60.png 104w" sizes="auto, (max-width: 920px) 100vw, 920px" /><figcaption class="wp-element-caption">Private Endpoint Creation. Step 2</figcaption></figure>



<p>The third screen is the most important one. We need to select the VNET and Subnets that will host our private endpoint. In this case we want to use databricks so we need to use the VNET we created for databrickks, and then the subnet we created specifically to host the private endpoints.</p>



<p>Another important step here is to integrate it with the DNS. If we dont integrate it, when we use the AzureSQL hostname provided by azure we will still access through the public endpoint. By integrating it in the DNS. the dns queries over the public endpoint in this private zone, will resolve to the private IP of the NIC of the Private Endpoint</p>



<p>If we chose no for the DNS integration then we will have to add static entries in the /etc/hosts or somewhat, or use the private IP instead of the hostname when connecting to the AzureSQL server. To simplify we choose to integrate it.</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="583" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3-1024x583.png" alt="" class="wp-image-1239" style="width:840px;height:478px" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3-1024x583.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3-300x171.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3-768x438.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3-105x60.png 105w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep3.png 1241w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"> Private Endpoint Creation. Step 3.</figcaption></figure>



<p>Once created we should see the private endpoint available. If you look at the right its implemented though a NIC (Network Interface card), and by clicking on it, we can find it and see the ip address assigned:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="177" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-1024x177.png" alt="" class="wp-image-1241" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-1024x177.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-300x52.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-768x132.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-1536x265.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-2048x353.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep5-348x60.png 348w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Our newly created Private Endpoint for Azure SQL</figcaption></figure>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="225" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-1024x225.png" alt="" class="wp-image-1242" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-1024x225.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-300x66.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-768x168.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-1536x337.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-2048x449.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep6-274x60.png 274w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Finding the PrivateIP Address of the NIC that implements the private endpoint</figcaption></figure>



<h2 class="wp-block-heading">Test the AzureSQL DB Endpoint from Databricks</h2>



<p>Now we have it ready. We can still see from outside that VNET, that our old server still resolves to a public ip, as it was the case before even inside databricks. We can ping it for testing purposes:</p>



<pre class="wp-block-code"><code>C:\Users\Albert&gt;ping azure-sql-server-albert.database.windows.net

Haciendo ping a cr4.westeurope1-a.control.database.windows.net &#91;<strong>104.40.168.105</strong>] con 32 bytes de datos:</code></pre>



<p>As you can see we have a public ip, but lets try to ping it inside the cluster:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="885" height="178" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep7.png" alt="" class="wp-image-1244" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep7.png 885w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep7-300x60.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep7-768x154.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep7-298x60.png 298w" sizes="auto, (max-width: 885px) 100vw, 885px" /><figcaption class="wp-element-caption">Private endpoint with the DNS integration working fine. Our dns record for the AzureSQL Db does not resolve to a public ip anymore but to the private IP of the PrivateEndpoint</figcaption></figure>



<p>So it&#8217;s working. Its using the private ip instead of the public one. Our last step is to see if we can fetch the data from the database:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="363" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-1024x363.png" alt="" class="wp-image-1245" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-1024x363.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-300x106.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-768x272.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-1536x544.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-2048x726.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep8-169x60.png 169w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Accessing AzureSQL Database though a private endpoint from databricks</figcaption></figure>



<h2 class="wp-block-heading">Creating an AzureDataLake PrivateEndpoint and saving our data to the DataLake through it.</h2>



<p>We are not done yet! We can still complicate matters and create a private endpoint as well to save data to our datalake.</p>



<p>I&#8217;ve created an ADLS Gen 2 storage account, and going back to databricks I see by default it&#8217;s using public access:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="836" height="194" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep9.png" alt="" class="wp-image-1246" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep9.png 836w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep9-300x70.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep9-768x178.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep9-259x60.png 259w" sizes="auto, (max-width: 836px) 100vw, 836px" /><figcaption class="wp-element-caption">Datalake public access</figcaption></figure>



<p>But we can implement a Private Endpoint as well, and route all the traffic through the azure datacenter itself. Lets see how to do it. For achieving this, we go to our ADS Gen2 storage account, and on the left we click again in Networking, and the second tab is called Private Endpoint connections. We click the plus button to create a new one, and basically we follow the same steps as before with a subtle difference</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="930" height="760" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep10.png" alt="" class="wp-image-1247" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep10.png 930w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep10-300x245.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep10-768x628.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep10-73x60.png 73w" sizes="auto, (max-width: 930px) 100vw, 930px" /><figcaption class="wp-element-caption">Creation of a private endpoint for an ADLS Gen2 storage account.</figcaption></figure>



<p>The difference with a storage account is that we need to chose which api we want to create the private endpoint for. We can use the blob, the table, the queue, the file share and the dfs (DataLake) endpoint (And also the static website!).</p>



<p>We will use the dfs endpoint, and again we will place it in the Private Endpoint subnet of our Databricks vnet. Something like this:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="814" height="990" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep11.png" alt="" class="wp-image-1248" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep11.png 814w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep11-247x300.png 247w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep11-768x934.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep11-49x60.png 49w" sizes="auto, (max-width: 814px) 100vw, 814px" /><figcaption class="wp-element-caption">Creating a Private Endpoint for our DataLake an dplacing it in the appropiate subnet</figcaption></figure>



<p>After a few minutes our private endpoint will be ready to be used. We can go again to see the NIC and check the private ip or go directly to databricks and ping the storage account url to see if now it&#8217;s resolving to our private endpoint:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="796" height="184" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep12.png" alt="" class="wp-image-1249" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep12.png 796w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep12-300x69.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep12-768x178.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep12-260x60.png 260w" sizes="auto, (max-width: 796px) 100vw, 796px" /><figcaption class="wp-element-caption">As we can see now databricks resolves our storage account through the private endpoint</figcaption></figure>



<h2 class="wp-block-heading">Test the ADLS Gen2 SA endpoint from Databricks </h2>



<p>If we have the IAM credential Passthrough enabled in our cluster and we have permisison to write to the datalake, now we should be able to write there without going through the internet:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="175" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-1024x175.png" alt="" class="wp-image-1250" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-1024x175.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-300x51.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-768x132.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-1536x263.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-2048x351.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQlPep13-350x60.png 350w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Writing to a DataLake through the Private Endpoint we just created</figcaption></figure>



<p>So this is the end of the tutorial. We created two private endpoints, one for AzureSQL Database and Another for our DataLake and used bot them from Databricks. We also confirmed we are effectively using them by pinging the hostnames of both resources and seeing a change from the public ip to the private one.</p>



<p>Happy Data pojects!</p>
<p>The post <a href="https://www.albertnogues.com/using-azure-private-endpoints-with-databricks/">Using Azure Private Endpoints with Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Databricks and Spark Crash Course. Delta and More!</title>
		<link>https://www.albertnogues.com/databricks-and-spark-crash-course-delta-and-more/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=databricks-and-spark-crash-course-delta-and-more</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Thu, 25 Mar 2021 17:58:00 +0000</pubDate>
				<category><![CDATA[BigData]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[delta]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scala]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">http://192.168.1.40/?p=1025</guid>

					<description><![CDATA[<p>I&#8217;ve been working on a Databricks and Delta tutorial for all of you. I published it as notebook and you can grab it here. We will load some sample data from the NYC taxi dataset available in databricks, load them and store them as table. We will use then python to do some manipulation (Extract &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/databricks-and-spark-crash-course-delta-and-more/">Databricks and Spark Crash Course. Delta and More!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>I&#8217;ve been working on a Databricks and Delta tutorial for all of you. I published it as notebook and you can grab it <a rel="noreferrer noopener" href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/897686883903747/2503669437642038/312541189568512/latest.html" data-type="URL" data-id="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/897686883903747/2503669437642038/312541189568512/latest.html" target="_blank">here</a>.</p>



<p>We will load some sample data from the NYC taxi dataset available in databricks, load them and store them as table. We will use then python to do some manipulation (Extract month and year from the trip time), which will create two new additional columns to our dataframe and will check how the file is saved in the hive warehouse. We will observe we have some junk data as it created folders for months and years (partitioning), that we are not supposed to have, so we will use filter to apply some filter in python way and in sql way to filter these bad records</p>



<p>Then, we will load another month of data as a temporary view and will compare this in contrast with a delta table where we can run updates and all sort of DML.</p>



<p>As a last step, we will load some master data and will perform a join. For more on Delta Lake you can follow this tutorial &#8211;&gt; <a href="https://delta.io/tutorials/delta-lake-workshop-primer/">https://delta.io/tutorials/delta-lake-workshop-primer/</a></p>



<p>Enjoy coding!</p>
<p>The post <a href="https://www.albertnogues.com/databricks-and-spark-crash-course-delta-and-more/">Databricks and Spark Crash Course. Delta and More!</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
