Azure Archives - Albert Nogués

Data Quality Checks with Soda-Core in Databricks

Albert — Fri, 31 May 2024 11:17:06 +0000

It’s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I’ve tested it within a databricks environment and it worked quite easily for me.

For the examples of this article i am loading the customers table from the tpch delta tables in the databricks-datasets folder.

First of all we need to install the library either scoped to our Databricks notebook or on our cluster. In my case i will install it notebook scoped:

%pip install soda-core-spark-df

Then we create a dataframe from the tpch customers table:

#We create a table and read it into a dataframe
customer_df = spark.read.table("delta.`/databricks-datasets/tpch/delta-001/customer/`")

We create a temporary view for our dataframe so soda can query the data and run the checks:

#We create a TempView
customer_df.createOrReplaceTempView("customer")

And here it comes the whole soda core. We will define the checks using yaml syntax:

from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("Databricks Test Notebook")
scan.set_data_source_name("customer")
scan.add_spark_session(spark, data_source_name="customer")
#YAML Format
checks = '''
checks for customer:
  - row_count > 0
  - invalid_percent(c_phone) = 0:
      valid regex: ^[0-9]{2}[-][0-9]{3}[-][0-9]{3}[-][0-9]{4}$
  - duplicate_count(c_phone) = 0:
      name: No duplicate phone numbers
  - invalid_count(c_mktsegment) = 0:
      invalid values: [HOUSEHOLD]
      name: HOUSEHOLD is not allowed as a Market Segment
'''
# you can use add_sodacl_yaml_file(s). Useful if the tests are in a github repo or FS
scan.add_sodacl_yaml_str(checks)
scan.execute()
print(scan.get_logs_text())

More info: Add Soda to a Databricks notebook | Soda Documentation

List of validations: Validity metrics | Soda Documentation and SodaCL metrics and checks | Soda Documentation

We can somewhat enhance it and generate a Spark Dataframe all out of the list of our warnings or error validation checks:

from datetime import datetime
schema_checks = 'datasource STRING, table STRING, rule_name STRING, rule STRING, column STRING, check_status STRING, number_of_errors_in_sample INT, check_time TIMESTAMP'
list_of_checks = []
for c in scan.get_scan_results()['checks']:
    list_of_checks = list_of_checks + [[scan.get_scan_results()['defaultDataSource'], c['table'], c['name'], c['definition'], c['column'], c['outcome'], 0 if 'pass'in c['outcome'] else int(c['diagnostics']['blocks'][0]['totalFailingRows']), datetime.strptime(scan.get_scan_results()['dataTimestamp'], '%Y-%m-%dT%H:%M:%S%z')]]
list_of_checks_df = spark.createDataFrame(list_of_checks,schema_checks)
display(list_of_checks_df)

In the case we have the yaml file in our github repo, we can read it and pass it. Or If we are working with Databricks repos and the file is part of out repo we can load it locally

Accessing a remote file and reading it with requests:

#Trying to use a remote yaml file to enforce rules. We can upload it to a github of our own and use it in opur notebook.
#I've created a public repo so i dont need to authenticate to github, but in a real world scenario we should use private repo + secret scopes
customer_quality_rules = 'https://raw.githubusercontent.com/anogues/soda-core-quality-rules/main/soda-core-quality-rules-customer.yaml'
import requests
scan.add_sodacl_yaml_str(requests.get(customer_quality_rules).text)

Or we can load it locally if we are using databricks repos:

scan.add_sodacl_yaml_file("your_file.yaml")

The post Data Quality Checks with Soda-Core in Databricks appeared first on Albert Nogués.

Query Delta Tables in the DataLake from PowerBi with Databricks

Albert — Wed, 15 Nov 2023 18:47:00 +0000

There are several ways to query delta tables from PowerBi.

You can use snowflake with external stages reading delta data from the DL,
The parquet connector and directly query the data from the datalake (caution as this method does not support SPNs and also you can only use delta tables with only 1 version)
or with Delta Sharing plugin (Not possible currently in danone due not having unity catalog)
Or the recommended way, using databricks to do it (With a SPN + Databricks CLuster (either DataEngineering or SQL Warehouse))

We are going to cover the 4th method here. To do it first we need a service princpal, a secret scope pointing to a databricks keyvault and the password of the SPN stored in this keyvault.

Once we have this, the first step is to set up the cluster with the credentials to access the datalake. For this we need to configure the spark variables of our databricks cluster. You can follow the guide here.

After that you cluster should have the credentials in the spark conf section, something like this:

The second step is creating an EXTERNAL table in Databricks to pint to our delta table(s). For this we connect to our databricks workspace with the previous configuration, and create a new notebook and define the external tables we want, something like this:

%sql
CREATE TABLE IF NOT EXISTS anogues.customers_external 
LOCATION 'abfss://raw@albertdatabricks001.dfs.core.windows.net/customers'

Make sure the location is the right one otherwise when we query the data we will get either an error or no results.

Once we have this we can query our external table and verify we can see the data:

%sql
select * from anogues.customers_external LIMIT 5;

And providing we did it right we should see the data. There is no need to define the table columns as delta uses parquet under the hood so it’s a self contained format where the schema is stored alongside with the data.

Once we confirmed this is working we can go to powerbi, try to import data using the Databricks Connector:

To configure the connector we need to get some details from our cluster. These can be found in the advanced options of our cluster, in the tab JDBC/ODBC

On the following screen we need to select our authentication options to connect to the databricks cluster. Since we have SAML + SCIM enabled in our workspaces the user and password option is not possible. Either we need a databricks PAT token or use Azure AD. I recommend the latter:

We click on it and select our AAD account. If all works well our session will be started. We should see it in the screen:

Then we click on connect, and we can see our data. Since we don’t have unity catalog, our table should appear in the hive_metastore catalog. There we can find our database and inside our table(s). We click and either load all the tables we want or start transforming them inside powerbi, like we will do with any other data source.

For more help here is the detail of the powerbi databricks connector: Connect Power BI to Azure Databricks – Azure Databricks | Microsoft Learn

The post Query Delta Tables in the DataLake from PowerBi with Databricks appeared first on Albert Nogués.

Smallest Analytical Platform Ever!

Albert — Sat, 07 May 2022 08:38:12 +0000

I’ve started working on some of my free time in a project to build the smallest useful analytics platform on the cloud (starting with azure).

The purpose is to use it a sa PoC to show to colleagues, managers, prospective customers or just to have fun and play

It’s publicly available on my github repo and any collaboration is welcome. You can fork it, improve it, send PR’s and do whatever you want!

The first version will run solely on azure. The objective is to show the following technologies/disciplines:

* Infrastructure as a Code (IaaC), by using Terraform

* Cloud architecture anc Cloud Ops by using an azure cloud environment

* Data Engineering by using a Spark powered Databricks Notebook and an ADF Pipeline (future)

* DevOps to trigger some pipelines based on changes (future)

* Basic Security concepts (keyvault, service principals, least privileged rbac accesses…)

* FinOps keeping the costs at minimum and choosing the proper tools for the job

* Reporting and Dashboarding on data in the platform

* Data management: We will use an adls storage account and azure sql db

TOOLS:

* Terraform to deploy all the infra as a code

* Azure Cloud to host our resources

You have the code plus all the information on my github repo:

https://github.com/anogues/ProjectZ

The post Smallest Analytical Platform Ever! appeared first on Albert Nogués.

Implementing CI/CD in Databricks with Azure DevOps (Part 1)

Albert — Sat, 30 Apr 2022 15:10:29 +0000

There are many ways to implement some CI/CD with Databricks. We can use Azure DevOps, Github+Github Actions or any other combination of tools, including the dbx tool.

But an easy way to just copy notebooks between workspaces can be implemented easily with Azure DevOps.

We are going to use the git repos capability of Azure Databricks, so when a new code change is commited in a notebook an Azure DevOps pipeline will trigger the transport copy of the workbook from the first Databricks workspace (in our case a NonProd workspace) to the target one, again, in our case, the Prod workspace.

To achieve this we will use some more components from the Azure ecosystem, including the use of Keyvaults to keep all our secrets stored safely. The list of prerequisites is the following:

Two Databricks workspaces, one our source workspace (NonProd) and another, our Production one.
An Azure Keyvault (or two if we want to segregate the environments)
Azure Databricks repository configured at least in our source workspace, so when the change is commited we can triger the pipeline that will fetch the notebook and transport it to the prod workspace
Access to Azure DevOps (Something similar can be implemented with Github + Github Actions)

Lets see how to implement it. First we need to make sure git repos in enabled our source workspace. We can verify by login with an admin privileged user to our workspace and make sure the option is checked as follows:

Fig 1. Make sure that github repos is enabled in our workspace.

Secondly, we go to Azure DevOps services and we create a new project. I’ve called it DatabricksCICD but feel free to call it whatever you need:

Fig 2. Create a new repo and initialize it

Once created we take the details for cloning our repo and copying them. We go to our databricks workspace and then we look for the Repos option on the left, and add a new repository. We need to paste the url to clone our newle Azure DevOps created repository:

Fig 3. Cloning our Azure DevOps Repository

Once we linked our Databricks workspace with our DevOps repo, now we can create a new notebook. In the same Repos section, click on the down arrow to create a new notebook, as shown below:

Fig 4. Creating a new notebook.

The content of the notebook, you can put anything you want. I’m writing a print(“Hello from Albert”) statement. We will not run it, we just want to show it’s possible to transport it. Once done, click on the save now in the revision tab:

Fig 5. Saving our changes to the notebook.

Then click on the left in the main branch button, from there we will be commiting the changes to our repository:

Fig 6. Pushing our notebook to the DevOps repo

If we go back now to our Azure DevOps project we should see the file has been commited to the repository. This ends the first part of this tutorial.

In the second blog entry we will see how to trigger the pipeline after a modification of this notebook and passing the credentials of the second workspace to be able to deliver the changed notebook to our Production (target) workspace.

The post Implementing CI/CD in Databricks with Azure DevOps (Part 1) appeared first on Albert Nogués.

Using Azure Private Endpoints with Databricks

Albert — Thu, 09 Dec 2021 19:31:32 +0000

In this article i will show how to avoing going outside to the internet when using resources inside azure, specially if they are in the same subscription and location (datacenter).

Why we may want a private endpoint? Thats a good question. For oth security and performance. Just like using TSCM Equipment for optimal safety and security. We dont want the traffic going outside to the internet to return again back to the azure datacenter if the resource we are trying to reach is already there. So with a PrivateLink the traffic will stay inside the Azure backbone network avoiding reaching the internet. More information about private endpoints here and here.

Though its possible to create private endpoints to connect to services in other subcriptions we will use the same subscription and the West Europe Region in this article. The goal is to connect to both a AzureSQL database using private connectivity and to a datalake using private connectivity as well.

Creating a Private Endpoint for AzureSQL and integrating in the databricks vnet

For this, i created a Databricks workspace and selected to use an already existing VNET, so this way I can add a new subnet for my private endpoints. One of the good things of doing this way is that NICs between subnets see each other and are reacheable (unless we block it with a network security group) but by default traffic is open within the VNET. So I can create a Private endpoint in a specific subnet of the same VNET that hosts the databricks subnets.

Bear in mind that it’s not possible to add a private endpoint to a subnet managed by databricks. So the two subnets we created, when we deployed the databricks workspace (Bot public and private) should not be modified. We will create a new one as shown in the screen below:

Our Databricks VNET. Among the two subnets created when the databricks workspace is created i added a new one to host our Private Endpoints

Once defined properly the VNET, we are going to create the private endpoint to reach our AzureSQL through it.

First we need to go to the Azure Portal, find our AzureSQL Server, and click on the left menu called Private Endpoint Connections and click on the plus sing on top to create a new one. We just need to select the subscription, the resource group the nameof the private endpoint and the region. We can fill it as shown in the following picture:

Private Endpoint Creation. Step 1

The second step requires a bit more of information, here we will define which resource we try to target with our private endpoint. As expected we need to find our AzureSQL Server here. We fill the combo boxes as usual

Private Endpoint Creation. Step 2

The third screen is the most important one. We need to select the VNET and Subnets that will host our private endpoint. In this case we want to use databricks so we need to use the VNET we created for databrickks, and then the subnet we created specifically to host the private endpoints.

Another important step here is to integrate it with the DNS. If we dont integrate it, when we use the AzureSQL hostname provided by azure we will still access through the public endpoint. By integrating it in the DNS. the dns queries over the public endpoint in this private zone, will resolve to the private IP of the NIC of the Private Endpoint

If we chose no for the DNS integration then we will have to add static entries in the /etc/hosts or somewhat, or use the private IP instead of the hostname when connecting to the AzureSQL server. To simplify we choose to integrate it.

Private Endpoint Creation. Step 3.

Once created we should see the private endpoint available. If you look at the right its implemented though a NIC (Network Interface card), and by clicking on it, we can find it and see the ip address assigned:

Our newly created Private Endpoint for Azure SQL

Finding the PrivateIP Address of the NIC that implements the private endpoint

Test the AzureSQL DB Endpoint from Databricks

Now we have it ready. We can still see from outside that VNET, that our old server still resolves to a public ip, as it was the case before even inside databricks. We can ping it for testing purposes:

C:\Users\Albert>ping azure-sql-server-albert.database.windows.net

Haciendo ping a cr4.westeurope1-a.control.database.windows.net [104.40.168.105] con 32 bytes de datos:

As you can see we have a public ip, but lets try to ping it inside the cluster:

Private endpoint with the DNS integration working fine. Our dns record for the AzureSQL Db does not resolve to a public ip anymore but to the private IP of the PrivateEndpoint

So it’s working. Its using the private ip instead of the public one. Our last step is to see if we can fetch the data from the database:

Accessing AzureSQL Database though a private endpoint from databricks

Creating an AzureDataLake PrivateEndpoint and saving our data to the DataLake through it.

We are not done yet! We can still complicate matters and create a private endpoint as well to save data to our datalake.

I’ve created an ADLS Gen 2 storage account, and going back to databricks I see by default it’s using public access:

Datalake public access

But we can implement a Private Endpoint as well, and route all the traffic through the azure datacenter itself. Lets see how to do it. For achieving this, we go to our ADS Gen2 storage account, and on the left we click again in Networking, and the second tab is called Private Endpoint connections. We click the plus button to create a new one, and basically we follow the same steps as before with a subtle difference

Creation of a private endpoint for an ADLS Gen2 storage account.

The difference with a storage account is that we need to chose which api we want to create the private endpoint for. We can use the blob, the table, the queue, the file share and the dfs (DataLake) endpoint (And also the static website!).

We will use the dfs endpoint, and again we will place it in the Private Endpoint subnet of our Databricks vnet. Something like this:

Creating a Private Endpoint for our DataLake an dplacing it in the appropiate subnet

After a few minutes our private endpoint will be ready to be used. We can go again to see the NIC and check the private ip or go directly to databricks and ping the storage account url to see if now it’s resolving to our private endpoint:

As we can see now databricks resolves our storage account through the private endpoint

Test the ADLS Gen2 SA endpoint from Databricks

If we have the IAM credential Passthrough enabled in our cluster and we have permisison to write to the datalake, now we should be able to write there without going through the internet:

Writing to a DataLake through the Private Endpoint we just created

So this is the end of the tutorial. We created two private endpoints, one for AzureSQL Database and Another for our DataLake and used bot them from Databricks. We also confirmed we are effectively using them by pinging the hostnames of both resources and seeing a change from the public ip to the private one.

Happy Data pojects!

The post Using Azure Private Endpoints with Databricks appeared first on Albert Nogués.

Databricks connectivity to Azure SQL / SQL Server

Albert — Thu, 09 Dec 2021 10:45:34 +0000

Most of the developments I see inside databricks rely on fetching or writing data to some sort of Database.

Usually the preferred method for this is though the use of jdbc driver, as most databases offer some sort of jdbc driver.

In some cases, though, its also possible to use some spark optimized driver. This is the case in Azure SQL / SQL Server. We have still the option to use the standard jdbc driver (what most people do because it’s standard to all databases) but we can improve the performance by using a specific spark driver. Till some time ago it was only supported with the Scala API but now it’s possible to be used in Python and R as well, so there is no reason not to give it a try.

In this article we will see the two options to make this connectivity. For the test purposes we will connect to an Azure SQL in the same region (West Europe).

Connecting to AzureSQL through jdbc driver.

In this case the jdbc driver is already shipped in the databricks cluster, we do not need to install anything. We just can connect directly. Lets see how (We have a scala example here but i will use python for this example)

#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")

jdbcDF = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()

jdbcDF.show()

Spark Dataframe from a JDBC Azure SQL DB Source

So as we saw we have been able to connect successfully to our Azure SQL DB using the jdbc driver shipped with databricks. Lets now try to change to the spark optimized driver

Connecting to AzureSQL through the spark optimized driver

To connect using the spark optimized driver, first we need to install the driver in the cluster, as it’s not available by default.

The driver is available in Maven for both spark 2.X and 3.X. In the microsoft website we can find more information on where to get them and how to use them. For this exercise purposes we will inbstall it through databricks libraries, using maven. Just add in the coordinates box the following: com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as can be seen in the image below

Installing the spark AzureSQL Driver from Maven

Once installed we should see a green dot next to the driver, and this will mean the driver is ready to be used. We go back to our notebook and try

#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")
jdbcDF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .load()

jdbcDF.show()

If we see an error like java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark this means that the driver can’t be found, so probably it’s not properly installed. Check back the libraries in the cluster and make sure the status is installed. If all goes well we should see again our dataframe:

Spark Dataframe from a Spark Azure SQL DB Source

The reason why we should use the optimized spark driver is usually because of performance reasons. Microsoft claims its about 15x faster than the jdbc one. But there is more. The spark driver also allows AAD authentication either by using a service principal or an AAD account, apart of course from the native sql server authentication. Lets try if it works with an AAD account:

jdbcDF = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("authentication", "ActiveDirectoryPassword") \
    .option("user", "sqluser@anogues4hotmail.onmicrosoft.com") \
    .option("password", "XXXXXX") \
    .option("encrypt", "true") \
    .option("hostNameInCertificate", "*.database.windows.net") \
    .load()
jdbcDF.show()

For using a service principal you need to generate a token. In python this can be accomplished with the adal library (That needs to be installed in the cluster as well from pypi). You have a sample notebook in microsoft spark driver github account here.

More information about the driver can be found on the microsoft github repository here.

The post Databricks connectivity to Azure SQL / SQL Server appeared first on Albert Nogués.

Ansible playbook to configure Azure Red Hat VM’s

Albert — Wed, 21 Apr 2021 10:43:36 +0000

In todays post I am going to share an ansible playbook to configure a new VM recently launched. This playbook contains the following:

Change Admin password
Create linux Group
Add user to several groups
Create a new user with specific salted password (Check point 3 for generating the hashed salt)
Find all the mounted disks in any LUN and format them (create a xfs filesystem)
Create a mountpoint (in /opt)
Mount the disk formatted in LUN 0 PART 1 in the specified mountoint in fstab, so it will be persistent across reboots
Register the OS into RHEL Satellite so yum is usable
Disable some local repo in an external /mnt drive that crashes yum in the image
Install Telnet and other packages
Modify some Ulimits
Register Machine in a domain (TO BE DONE)

To Start, lets create first the ansible inventory file and call it inventory.yaml (Or use init format if you want). The format of the file should be something like the folowing:

all:
  children:
    myservers:
      hosts:
        SERVER001:
          ansible_host: 10.10.1.1
        SERVER002:
          ansible_host: 10.10.1.2
        SERVER003:
          ansible_host: 10.10.1.3
        SERVER004:
          ansible_host: 10.10.1.4
        SERVER005:
          ansible_host: 10.10.1.5
    alberttest:
      hosts:
        testalbert001:
          ansible_host: 10.10.1.6

Then we can create our playbook file (I will comment it below). This will be another yaml file with the following content:

---
- hosts: myservers
  collections:
    - ansible.posix
  tasks:
  - name: Ping the Server
    ping:

  - name: Change Password of the Admin user.
    user:
      name: Admin
      # python3 -c 'import crypt; print (crypt.crypt("Passw0rd", "$1$SomeSalt$"))'
      password: $1$SomeSalt$C7s11A7tyf8OKOg0JoCYp/

  - name: Create anogues group
    group:
      name: anogues
      state: present

  - name: Create anogues user
    user:
      name: anogues
      password: $1$SomeSalt$C7s11A7tyf8OKOg0JoCYp/
      shell: /bin/bash
      groups: infaedc, wheel
      append: yes

  - name: Create mountpoint
    file: path=/opt/data state=directory

  - name: Find all Luns
    find:
      paths: /dev/disk/azure/scsi1/
      file_type: link
      recurse: No
      patterns: "lun?"
    register: files_matched

  - name: Partition Disk to Max
    shell: "parted {{ item.path }} --script mklabel gpt mkpart xfspart xfs 0% 100%"
    args:
      executable: /bin/bash
    loop: "{{ files_matched.files|flatten(levels=1) }}"

  - name: Inform OS of partition table changes
    command: partprobe

  - name: find UUID of sdX1
    shell: |
      blkid -s UUID -o value $(readlink -f /dev/disk/azure/scsi1/lun0-part1)
    register: uuid

  - name: show real uuid
    debug:
      msg: "{{ uuid.stdout }}"

  - name: Mount disk drive in fstab
    mount:
      path: /opt/data
      src: 'UUID={{uuid.stdout}}'
      fstype: xfs
      opts: defaults,nofail
      dump: 1
      passno: 2
      state: mounted

  - name: Check disk Status
    shell: df -h | grep /dev/sd
    register: df2_status

  - name: Show mounted FS
    debug:
      msg: "{{ df2_status.stdout_lines }}"

  - name: Register RHEL
    redhat_subscription:
      state: present
      username: {{ lookup('env', 'RHEL_USER') }}
      password: {{ lookup('env', 'RHEL_PASSWORD') }}
      autosubscribe: yes

  - name: Disable Media Repo
    ini_file:
      dest: /etc/yum.repos.d/media.repo
      section: "{{item}}"
      option: enabled
      value: 0
    with_items:
      - LocalRepo_BaseOS
      - LocalRepo_AppStream

  - name: Install Telnet and other packags
    yum:
      name:
        - telnet
        - curl
        - zip
        - unzip
        - tar
        - wget
        - libcurl
      state: present

  - name: Add or modify nproc hard limit for the user anogues. Set 65k value.
    pam_limits:
      domain: anogues
      limit_type: hard
      limit_item: nproc
      value: 65000
    become: yes
    become_method: sudo
    become_user: root

  - name: Add or modify nproc soft limit for the user anogues. Set 65k value.
    pam_limits:
      domain: anogues
      limit_type: soft
      limit_item: nproc
      value: 65000
    become: yes
    become_method: sudo
    become_user: root

Let’s have a look at how it works:

One of the playbook steps is to register RHEL into Satellite. To avoid hardcoding the user and password in the playbooy these values will be taken from your env vars, so please export them before

export RHEL_USER=
export RHEL_PASSWORD=

You need to install ansible-posix first (Tested with 1.2.0.). Download it from here: https://galaxy.ansible.com/ansible/posix
ansible-galaxy install ansible-posix-1.2.0.tar.gz

To create a crypted user with a salt you can do the following
python3 -c ‘import crypt; print (crypt.crypt(“YOUR_UNHASHED_PASSWORD”, “$1$SomeSalt$”))’
And copy the hashed salt into the playbook area for the user you want to set the password

Azure Mounts the Disk Drives by Luns. By default it starts in Lun0, So i am using the softlink to /dev/disk/azure to check the disks added. This playbook will only automount the first disk. If you need more simply modify the playbook and add as many luns as you need

# tree /dev/disk/azure
/dev/disk/azure
├── resource -> ../../sdb
├── resource-part1 -> ../../sdb1
├── root -> ../../sda
├── root-part1 -> ../../sda1
├── root-part14 -> ../../sda14
├── root-part15 -> ../../sda15
├── root-part2 -> ../../sda2
└── scsi1
├── lun0 -> ../../../sdc
└── lun1 -> ../../../sdd

Here there are two disks mounted in two luns, 0 and 1. No disk has been formatted yet as we only see sdc and sdd drives with no partition. Once the first one is formatted, you will see a difference:

/dev/disk/azure
├── resource -> ../../sdb
├── resource-part1 -> ../../sdb1
├── root -> ../../sda
├── root-part1 -> ../../sda1
├── root-part14 -> ../../sda14
├── root-part15 -> ../../sda15
├── root-part2 -> ../../sda2
└── scsi1
├── lun0 -> ../../../sdc
├── lun0-part1 -> ../../../sdc1
└── lun1 -> ../../../sdd

To run it (make sure you have installed ansible 2.9 from RHEL repo):

ansible-playbook -i inventory.yaml -k playbook.yaml

The playbook also formats all drives found in any luns (All disks mounted in the VM), bout only mounts the first one as we only defined one mount point.

If we need to modify it, we can include all the tasks that ar enecessary (read the blkid, add entry in fstab into another loop block but previously we have to create as many directories for the mountpoints as necessay and feed them in the loop block, so they are assigned concurrently.

Links of interest:

https://stackoverflow.com/questions/19292899/creating-a-new-user-and-password-with-ansible

https://stackoverflow.com/questions/49424967/how-to-create-azure-vm-with-data-disk-and-then-format-it-via-ansible

The post Ansible playbook to configure Azure Red Hat VM’s appeared first on Albert Nogués.

Load data from azure blob storage and run TPC-DS queries on Azure Synapse.

Albert — Wed, 03 Feb 2021 15:35:20 +0000

In this article we will see how to provision an azure synapse cluster, load some large quantity of data from azure blob storage and run a query to see the contents and check performance. I plan to write a serie of articles arround data warehousing in the cloud so check out for new articles soon.

I’ve split the article in 3 steps that cover diverse topics:

Part 1. Deploying a synapse cluster.
Part 2. Load TPC-DS data.
Part 3. Run queries to verify the performance.

We need to create a synapse cluster. We head to the azure portal and we will create a Gen2: DW100c cluster, which is the cheapest option on sale for a bit more than 1.50 USD per hour. For this exercise I didnt create a synapse workspace, i just went with the “Dedicated SQL pool” because i am only interested in the synapse db warehouse, not spark or other engines this time. Check the following documentation for more help.

After a few minutes we will have the synapse warehouse ready.

Once created, we can take the url and connect from Dbeaver (or any other editor, you can use the free Azure Data Studio too!) to see if all is ok:

Check connectivity with the newly created synapse instance

If you have trouble, make sure you have your ip added in the whitelist of the firewall section of your synapse instance and that is open to the public:

Adding an IP to our synapse instance

2. Load TPC-DS Data

Unfortunatelly I haven’t been able to found any TPC-DS data in azure blob storage. The table structure I am going to use, is the fivetran table structure for azure synapse found in their repo here, albeit i will do a modification stated in this atscale pdf as I think it makes more sense. Basically, for the dimension tables we will replicate them and create a clustered columnstore index, and for the fact table, we will use a hash distribution by the sr_item_sk column. They also suggest a columnstore index order in the fact tables but for our test i dont think it’s necessary.

The table structure, extracted from fivetran repo that we are going to replicate is the same set of 4 tables i did for my previous article for redshift, that you can find here. Here is the modified list:

create table customer_address (
    ca_address_sk             bigint,
    ca_address_id             nvarchar(16),
    ca_street_number          nvarchar(10),
    ca_street_name            nvarchar(60),
    ca_street_type            nvarchar(15),
    ca_suite_number           nvarchar(10),
    ca_city                   nvarchar(60),
    ca_county                 nvarchar(30),
    ca_state                  nvarchar(2),
    ca_zip                    nvarchar(10),
    ca_country                nvarchar(20),
    ca_gmt_offset             float,
    ca_location_type          nvarchar(20)
)
WITH
( 
  DISTRIBUTION = REPLICATE,
  CLUSTERED COLUMNSTORE INDEX
)
GO

create table customer (
    c_customer_sk             bigint,
    c_customer_id             nvarchar(16),
    c_current_cdemo_sk        bigint,
    c_current_hdemo_sk        bigint,
    c_current_addr_sk         bigint,
    c_first_shipto_date_sk    bigint,
    c_first_sales_date_sk     bigint,
    c_salutation              nvarchar(10),
    c_first_name              nvarchar(20),
    c_last_name               nvarchar(30),
    c_preferred_cust_flag     nvarchar(1),
    c_birth_day               int,
    c_birth_month             int,
    c_birth_year              int,
    c_birth_country           nvarchar(20),
    c_login                   nvarchar(13),
    c_email_address           nvarchar(50),
    c_last_review_date        nvarchar(10)
)
WITH
( 
  DISTRIBUTION = REPLICATE,
  CLUSTERED COLUMNSTORE INDEX
)
GO

create table date_dim (
    d_date_sk                 bigint,
    d_date_id                 nvarchar(16),
    d_date                    nvarchar(10),
    d_month_seq               int,
    d_week_seq                int,
    d_quarter_seq             int,
    d_year                    int,
    d_dow                     int,
    d_moy                     int,
    d_dom                     int,
    d_qoy                     int,
    d_fy_year                 int,
    d_fy_quarter_seq          int,
    d_fy_week_seq             int,
    d_day_name                nvarchar(9),
    d_quarter_name            nvarchar(6),
    d_holiday                 nvarchar(1),
    d_weekend                 nvarchar(1),
    d_following_holiday       nvarchar(1),
    d_first_dom               int,
    d_last_dom                int,
    d_same_day_ly             int,
    d_same_day_lq             int,
    d_current_day             nvarchar(1),
    d_current_week            nvarchar(1),
    d_current_month           nvarchar(1),
    d_current_quarter         nvarchar(1),
    d_current_year            nvarchar(1) 
)
WITH
( 
  DISTRIBUTION = REPLICATE,
  CLUSTERED COLUMNSTORE INDEX
)
GO

create table web_returns (
    wr_returned_date_sk       bigint,
    wr_returned_time_sk       bigint,
    wr_item_sk                bigint,
    wr_refunded_customer_sk   bigint,
    wr_refunded_cdemo_sk      bigint,
    wr_refunded_hdemo_sk      bigint,
    wr_refunded_addr_sk       bigint,
    wr_returning_customer_sk  bigint,
    wr_returning_cdemo_sk     bigint,
    wr_returning_hdemo_sk     bigint,
    wr_returning_addr_sk      bigint,
    wr_web_page_sk            bigint,
    wr_reason_sk              bigint,
    wr_order_number           bigint,
    wr_return_quantity        int,
    wr_return_amt             float,
    wr_return_tax             float,
    wr_return_amt_inc_tax     float,
    wr_fee                    float,
    wr_return_ship_cost       float,
    wr_refunded_cash          float,
    wr_reversed_charge        float,
    wr_account_credit         float,
    wr_net_loss               float
)
WITH
( 
  DISTRIBUTION = HASH(wr_item_sk),
  CLUSTERED COLUMNSTORE INDEX
)
GO

19:33:36Started executing query at Line 2
Commands completed successfully.
19:33:36Started executing query at Line 23
Commands completed successfully.
19:33:36Started executing query at Line 50
Commands completed successfully.
19:33:36Started executing query at Line 87
Commands completed successfully.
Total execution time: 00:00:25.515

So we are good for the table structure. Its now time to import the data. We can try to copy it from an azure blob storage where fivetran has left the data already generated. I tried with their suggested COPY INTO COMMAND but didnt work for me because the row terminator was not specified, so if you have problems use the following statements that worked for me:

copy into date_dim
from 'https://fivetranbenchmark.blob.core.windows.net/tpcds/tpcds_1000_dat/date_dim/'
with (file_type = 'CSV', fieldterminator = '|', ENCODING = 'UTF8', ROWTERMINATOR='0X0A');


copy into customer
from 'https://fivetranbenchmark.blob.core.windows.net/tpcds/tpcds_1000_dat/customer/'
with (file_type = 'CSV', fieldterminator = '|', ENCODING = 'UTF8', ROWTERMINATOR='0X0A');

copy into customer_address
from 'https://fivetranbenchmark.blob.core.windows.net/tpcds/tpcds_1000_dat/customer_address/'
with (file_type = 'CSV', fieldterminator = '|', ENCODING = 'UTF8', ROWTERMINATOR='0X0A');


copy into web_returns
from 'https://fivetranbenchmark.blob.core.windows.net/tpcds/tpcds_1000_dat/web_returns/'
with (file_type = 'CSV', fieldterminator = '|', ENCODING = 'UTF8', ROWTERMINATOR='0X0A');

And after a few minutes (the fact table is a bit more than 10GB big) you will have your data loaded:

20:07:41Started executing query at Line 125
(73049 rows affected)
(12000000 rows affected)
(6000000 rows affected)
(71997522 rows affected)
Total execution time: 00:06:53.032

Note that we are using the 1TB set for Azure Synapse vs the 3 TB test we used for redshift so the comparison wouln’t be fair with this dataset. You can see the difference as our fact table has approximately 1/3 of the rows than in the redshift test. But still with that, we have a table with almost 72 million rows. If you want to generate another bigger set of data you can do it by using the generate_data.sh provided by fivetran and load the data to your own azure blob storage container.

3. Start running the queries.

As I did for the previous test with redshift, I will use query30 to test this. The query can be found in fivetran github repo updated to run with synapse, but you can adapt the original query as you want.

WITH customer_total_return 
     AS (SELECT wr_returning_customer_sk AS ctr_customer_sk, 
                ca_state                 AS ctr_state, 
                Sum(wr_return_amt)       AS ctr_total_return 
         FROM   web_returns, 
                date_dim, 
                customer_address 
         WHERE  wr_returned_date_sk = d_date_sk 
                AND d_year = 2000 
                AND wr_returning_addr_sk = ca_address_sk 
         GROUP  BY wr_returning_customer_sk, 
                   ca_state) 
SELECT TOP 100 c_customer_id, 
               c_salutation, 
               c_first_name, 
               c_last_name, 
               c_preferred_cust_flag, 
               c_birth_day, 
               c_birth_month, 
               c_birth_year, 
               c_birth_country, 
               c_login, 
               c_email_address, 
               c_last_review_date, 
               ctr_total_return 
FROM   customer_total_return ctr1, 
       customer_address, 
       customer 
WHERE  ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2 
                                FROM   customer_total_return ctr2 
                                WHERE  ctr1.ctr_state = ctr2.ctr_state) 
       AND ca_address_sk = c_current_addr_sk 
       AND ca_state = 'IN' 
       AND ctr1.ctr_customer_sk = c_customer_sk 
ORDER  BY c_customer_id, 
          c_salutation, 
          c_first_name, 
          c_last_name, 
          c_preferred_cust_flag, 
          c_birth_day, 
          c_birth_month, 
          c_birth_year, 
          c_birth_country, 
          c_login, 
          c_email_address, 
          c_last_review_date, 
          ctr_total_return

And the result:

20:31:22Started executing query at Line 145
(100 rows affected)
Total execution time: 00:01:40.416

And a pic with the 100 first rows:

The post Load data from azure blob storage and run TPC-DS queries on Azure Synapse. appeared first on Albert Nogués.

Deploying an Azure Storage Account with a queue with terraform and python

Albert — Wed, 23 Dec 2020 14:39:14 +0000

In this post we will deploy the azure infrastructure to have an storage account queue. For this we will use terraform which can be downloaded from here.

Before start, we need a few prerequisites, these are the following:

Installing terraform, azure-cli and azure-storage-queue
Create a service principal for deploying the resources with terraform
Creating the terraform tf file with all the components required to be deployed. These include:
- A Resource Group
- A Storage account
- A queue inside the previous storage account
Run the terraform deployment
Write the python code to send a message and retrieve it from the queue created in the storage account (Part II)

Let’s start!

Before starting to work with terraform, we need to install terraform. We can do it easily downloading it from the official website and placing it in the path of our sysvars. We can download the latest packaged binary from here.

To install azure-cli and azure-storage-queue we can use python + pip. If you dont have python, you can download it from the official website here. Make suire you install pip as well as part of the installation process.

Once python is setup we can run the following to install the packages and it’s dependencies:

pip install azure-cli azure-queue-storage

After installing the packages we are good to start moving to point 2. For the service principal creation we can use the recently installed azure-cli. The call to create the spn is as follows:

az ad sp create-for-rbac --name="SPForTerraform" --role="Contributor" --scopes="/subscriptions/ourSubscriptionId"

Make sure you replace ourSubscriptionId by your azure subscription. In case you dont know where to obtain it you can follow the following procedure or this one. The spn we will create will be named with the –name parameter and will have the contributor role. This means this spn will have access to everything but to manage users. So we can create any sort of resouce with in. This process may take a few seconds, after a while we should see something like this:

Changing "SPForTerraform" to a valid URI of "http://SPForTerraform", which is the required format used for service principal names
Creating a role assignment under the scope of "/subscriptions/xxxxxx"
  Retrying role assignment creation: 1/36
  Retrying role assignment creation: 2/36
  Retrying role assignment creation: 3/36
The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli
{
  "appId": "xxxxxx",
  "displayName": "SPForTerraform",
  "name": "http://SPForTerraform",
  "password": "yyyyyy",
  "tenant": "zzzzzz"
}

Make sure you protect these, as anybody getting them will be able to access your tenant and start deploying resources in it.

Now, we can move to the thid point which is creating the terraform file. We can start now our preferred editor and create a *.tf file with the following content:

provider "azurerm" {
subscription_id = "kkkkkk"
client_id = "xxxxxx"
client_secret = "yyyyyy"
tenant_id = "zzzzzz"
features {}
}

terraform {
  required_version = ">= 0.13"
  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
    }
  }
}

The first block initializes the provider azurerm from terraform, while the second one initializes the terraform file with the required azurerm providrer, and tells terraform where to obtain it and which is the minimum version required. client_id is the same as the appID returned by the az cli when the spn was created.

Then we need another three blocks, one for each resource we want to deploy:

resource "azurerm_resource_group" "rg1" {
name = "TerraformRG"
location = "West Europe"
tags = { Owner = "Albert Nogues" }
}

resource "azurerm_storage_account" "sacc1" {
  name                     = "teststorageacc1"
  resource_group_name      = azurerm_resource_group.rg1.name
  location                 = azurerm_resource_group.rg1.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

resource "azurerm_storage_queue" "queue1" {
  name                 = "queue1"
  storage_account_name = azurerm_storage_account.sacc1.name
}

The way terraform blocks work is the following:

resource “name_of_the_azurerm_resource” “our_alias”{

… parameters and vars…

}

We can use then the alias to select things from other blocks we defined previously. So for example, in our example, the storage account, takes the location from the location field defined in the resource group, this way we ensure we create the storage accoun in the same azure location than the resource group (even it’s not mandatory). An example of how to define an storage account can be found in the official documentation here.

The same with the queue, which is created in the storage_account_name referred by the previous block. We can also put tags to the resources as shown in the resource group block.

Once we are ready, we can launch terraform to create our resources. For security purposes (and if working in multiple environments) it’s usually not good having in the code the client_id, secret_id, tenant_id and subcription_id. Terraform can read system variables in execution time. These can be set (windows) or exported (linux) with these names:

ARM_SUBSCRIPTION_ID
ARM_CLIENT_ID
ARM_CLIENT_SECRET
ARM_TENANT_ID

Once all ready we can trigger our creation script. We can navigate where our .tf form is and launch it:

terraform plan

Which will give us the changes needed in our azure subscription to acommodate the resources, and it should show us three changes: one for the resource group, another for the storage account and another for the queue:

terraform.exe plan

azurerm_resource_group.rg: Refreshing state... [id=/subscriptions/kkkkkk/resourceGroups/TerraformRG]

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # azurerm_resource_group.rg1 will be created
  + resource "azurerm_resource_group" "rg1" {
      + id       = (known after apply)
      + location = "westeurope"
      + name     = "TerraformRG"
      + tags     = {
          + "Owner" = "Albert Nogues"
        }
    }

  # azurerm_storage_account.sacc1 will be created
  + resource "azurerm_storage_account" "sacc1" {
      + access_tier                      = (known after apply)
      + account_kind                     = "StorageV2"
      + account_replication_type         = "LRS"
      + account_tier                     = "Standard"
      + allow_blob_public_access         = false
      + enable_https_traffic_only        = true
      + id                               = (known after apply)
      + is_hns_enabled                   = false
      + large_file_share_enabled         = (known after apply)
      + location                         = "westeurope"
      + min_tls_version                  = "TLS1_0"
      + name                             = "teststorageacc1"
      + primary_access_key               = (sensitive value)
      + primary_blob_connection_string   = (sensitive value)
      + primary_blob_endpoint            = (known after apply)
      + primary_blob_host                = (known after apply)
      + primary_connection_string        = (sensitive value)
      + primary_dfs_endpoint             = (known after apply)
      + primary_dfs_host                 = (known after apply)
      + primary_file_endpoint            = (known after apply)
      + primary_file_host                = (known after apply)
      + primary_location                 = (known after apply)
      + primary_queue_endpoint           = (known after apply)
      + primary_queue_host               = (known after apply)
      + primary_table_endpoint           = (known after apply)
      + primary_table_host               = (known after apply)
      + primary_web_endpoint             = (known after apply)
      + primary_web_host                 = (known after apply)
      + resource_group_name              = "TerraformRG"
      + secondary_access_key             = (sensitive value)
      + secondary_blob_connection_string = (sensitive value)
      + secondary_blob_endpoint          = (known after apply)
      + secondary_blob_host              = (known after apply)
      + secondary_connection_string      = (sensitive value)
      + secondary_dfs_endpoint           = (known after apply)
      + secondary_dfs_host               = (known after apply)
      + secondary_file_endpoint          = (known after apply)
      + secondary_file_host              = (known after apply)
      + secondary_location               = (known after apply)
      + secondary_queue_endpoint         = (known after apply)
      + secondary_queue_host             = (known after apply)
      + secondary_table_endpoint         = (known after apply)
      + secondary_table_host             = (known after apply)
      + secondary_web_endpoint           = (known after apply)
      + secondary_web_host               = (known after apply)

      + blob_properties {
          + cors_rule {
              + allowed_headers    = (known after apply)
              + allowed_methods    = (known after apply)
              + allowed_origins    = (known after apply)
              + exposed_headers    = (known after apply)
              + max_age_in_seconds = (known after apply)
            }

          + delete_retention_policy {
              + days = (known after apply)
            }
        }

      + identity {
          + principal_id = (known after apply)
          + tenant_id    = (known after apply)
          + type         = (known after apply)
        }

      + network_rules {
          + bypass                     = (known after apply)
          + default_action             = (known after apply)
          + ip_rules                   = (known after apply)
          + virtual_network_subnet_ids = (known after apply)
        }

      + queue_properties {
          + cors_rule {
              + allowed_headers    = (known after apply)
              + allowed_methods    = (known after apply)
              + allowed_origins    = (known after apply)
              + exposed_headers    = (known after apply)
              + max_age_in_seconds = (known after apply)
            }

          + hour_metrics {
              + enabled               = (known after apply)
              + include_apis          = (known after apply)
              + retention_policy_days = (known after apply)
              + version               = (known after apply)
            }

          + logging {
              + delete                = (known after apply)
              + read                  = (known after apply)
              + retention_policy_days = (known after apply)
              + version               = (known after apply)
              + write                 = (known after apply)
            }

          + minute_metrics {
              + enabled               = (known after apply)
              + include_apis          = (known after apply)
              + retention_policy_days = (known after apply)
              + version               = (known after apply)
            }
        }
    }

  # azurerm_storage_queue.queue1 will be created
  + resource "azurerm_storage_queue" "queue1" {
      + id                   = (known after apply)
      + name                 = "queue1"
      + storage_account_name = "teststorageacc1"
    }

Plan: 3 to add, 0 to change, 0 to destroy.

------------------------------------------------------------------------

Note: You didn't specify an "-out" parameter to save this plan, so Terraform
can't guarantee that exactly these actions will be performed if
"terraform apply" is subsequently run.

Once we confirm this is the change we want and all is in order we can run terraform apply:

terraform.exe apply
...
...
...
Plan: 3 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

We say yes and let’s go!

It may happen that there is some error, like a taken name, in this case we will get an error:

Error: Error creating Azure Storage Account "teststorageacc1": storage.AccountsClient#Create: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status= Code="StorageAccountAlreadyTaken" Message="The storage account named teststorageacc1 is already taken."

  on terraformTest.tf line 48, in resource "azurerm_storage_account" "sacc1":
  48: resource "azurerm_storage_account" "sacc1" {

We can fix it and rerun again, and the resources that were created already will not be modified (like in this case the resource group).

azurerm_storage_account.sacc1: Creating...
azurerm_storage_account.sacc1: Still creating... [10s elapsed]
azurerm_storage_account.sacc1: Still creating... [20s elapsed]
azurerm_storage_account.sacc1: Creation complete after 21s [id=/subscriptions/kkkkkk/resourceGroups/TerraformRG/providers/Microsoft.Storage/storageAccounts/teststorageacc1anogues]
azurerm_storage_queue.queue1: Creating...
azurerm_storage_queue.queue1: Creation complete after 0s [id=https://teststorageacc1anogues.queue.core.windows.net/queue1]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

Now we can either call the cli or browse through the azure portal to confirm that the storage account was created and the queue inside it:

az storage account list
[
  {
    "accessTier": "Hot",
    "allowBlobPublicAccess": false,
    "azureFilesIdentityBasedAuthentication": null,
    "blobRestoreStatus": null,
    "creationTime": "2020-12-23T18:30:49.007107+00:00",
    "customDomain": null,
    "enableHttpsTrafficOnly": true,
    "encryption": {
      "keySource": "Microsoft.Storage",
      "keyVaultProperties": null,
      "requireInfrastructureEncryption": null,
      "services": {
        "blob": {
          "enabled": true,
          "keyType": "Account",
          "lastEnabledTime": "2020-12-23T18:30:49.085263+00:00"
        },
        "file": {
          "enabled": true,
          "keyType": "Account",
          "lastEnabledTime": "2020-12-23T18:30:49.085263+00:00"
        },
        "queue": null,
        "table": null
      }
    },
    "failoverInProgress": null,
    "geoReplicationStats": null,
    "id": "/subscriptions/kkkkkk/resourceGroups/TerraformRG/providers/Microsoft.Storage/storageAccounts/teststorageacc1anogues",
    "identity": null,
    "isHnsEnabled": false,
    "kind": "StorageV2",
    "largeFileSharesState": null,
    "lastGeoFailoverTime": null,
    "location": "westeurope",
    "minimumTlsVersion": "TLS1_0",
    "name": "teststorageacc1anogues",
    "networkRuleSet": {
      "bypass": "AzureServices",
      "defaultAction": "Allow",
      "ipRules": [],
      "virtualNetworkRules": []
    },
    "primaryEndpoints": {
      "blob": "https://teststorageacc1anogues.blob.core.windows.net/",
      "dfs": "https://teststorageacc1anogues.dfs.core.windows.net/",
      "file": "https://teststorageacc1anogues.file.core.windows.net/",
      "internetEndpoints": null,
      "microsoftEndpoints": null,
      "queue": "https://teststorageacc1anogues.queue.core.windows.net/",
      "table": "https://teststorageacc1anogues.table.core.windows.net/",
      "web": "https://teststorageacc1anogues.z6.web.core.windows.net/"
    },
    "primaryLocation": "westeurope",
    "privateEndpointConnections": [],
    "provisioningState": "Succeeded",
    "resourceGroup": "TerraformRG",
    "routingPreference": null,
    "secondaryEndpoints": null,
    "secondaryLocation": null,
    "sku": {
      "name": "Standard_LRS",
      "tier": "Standard"
    },
    "statusOfPrimary": "available",
    "statusOfSecondary": null,
    "tags": {},
    "type": "Microsoft.Storage/storageAccounts"
  }
]

And then confirming the storage account has been created we can check the queue inside:

az storage queue list --account-name teststorageacc1anogues

There are no credentials provided in your command and environment, we will query for the account key inside your storage account.
Please provide --connection-string, --account-key or --sas-token as credentials, or use `--auth-mode login` if you have required RBAC roles in your command. For more information about RBAC roles in storage, visit https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad-rbac-cli.
Setting the corresponding environment variables can avoid inputting credentials in your command. Please use --help to get more information.
[
  {
    "approximateMessageCount": null,
    "metadata": null,
    "name": "queue1"
  }
]

And we confirmed all is ready for our second part!

It’s time to move to azure-storage-queue python library for our 5th step. There are different ways to authenticate to the queue. Either we use the QueueServiceClient or the QueueService. the differences between both are in their official documentation here.

The easiest way albeit not the recommended one is to use the queue key, or even better, we can grab the connection string directly from the queue keys section. Be aware that this allows to do virtually all with that queue. So don’t lose this key, and if you do, rotate them and generate new ones.

Just open a new python file and to write a sync message use the following code:

from azure.storage.queue import QueueClient

queue = QueueClient.from_connection_string(conn_str="DefaultEndpointsProtocol=https;AccountName=teststorageacc1anogues;AccountKey=xxxxxx;EndpointSuffix=core.windows.net", queue_name="queue1")

queue.send_message("Hello World!")

This will send a Hello World message to our queue. If we print the output received we should see some debug info like the following:

{'id': 'a489914f-cc7e-4504-9feb-33b0efce6587', 'inserted_on': datetime.datetime(2020, 12, 24, 19, 21, 20, tzinfo=datetime.timezone.utc), 'expires_on': datetime.datetime(2020, 12, 31, 19, 21, 20, tzinfo=datetime.timezone.utc), 'dequeue_count': None, 'content': 'Hello World!', 'pop_receipt': 'AgAAAAMAAAAAAAAA727vbgja1gE=', 'next_visible_on': datetime.datetime(2020, 12, 24, 19, 21, 20, tzinfo=datetime.timezone.utc)}

Now if we do not believe the message has arribed properly to the queue, we can go to the azure portal and check by ourselves. For this, we find ou storage account, click in queues and on queue1 (or whatever name you choose in terraform) and you should see the message along with the arrival time and the expiry time (by default, 7 days from now on)

Let’s see how to retrieve that message from python now. Remember that storage account queues do not delete the processed messages so If you do not delete them they will stay until expiration (even they will stay “hidden” for 90 seconds by default unless you pass other parameter) , so make sure your call deletes the message once processed (in this case read):

messages = queue.receive_messages()

for msg in messages:
    print(msg.content)
    queue.delete_message(msg)

And this will output our message:

Hello World!

If you want to only see your message without actually “reading it” you can use the peek function. This will not mark your message as “seen” and will be returned by any other consumer reading the queue or any retry you do.

If we go back to the azure portal and check the queue we should see the message is not there anymore. We can also delete the messages from the azure portal,a s well as create them directly there.

To improve our code a little bit we can use SAS keys instead of the storage account master keys. You can generate a SAS token directly from python with the following code:

from datetime import datetime, timedelta
from azure.storage.queue import QueueServiceClient, generate_account_sas, ResourceTypes, AccountSasPermissions

sas_token = generate_account_sas(
    account_name="",
    account_key="",
    resource_types=ResourceTypes(service=True),
    permission=AccountSasPermissions(read=True),
    expiry=datetime.utcnow() + timedelta(hours=1)
)

queue_service_client = QueueServiceClient(account_url="https://.queue.core.windows.net", credential=sas_token)

But you still need to have your credentials inside the code. So the safest way to manage it is to create the SAS token from somewhere else with a caducity not long in the future and use this in your code, or use environment variables for passing sensitive information to your code. With the os library in python you can read them and use inside your code, without having to be written there and exposing them in your code or git.

For more information you can check microsoft official documentation here, the quickstart guide here or the pypi page here.

The post Deploying an Azure Storage Account with a queue with terraform and python appeared first on Albert Nogués.