ELT with Azure Data Bricks
Hands-on lab step-by-step
Feb 2020
Information in this document, including URL and other Internet Web site references, is subject to change without notice. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not responsible for the contents of any linked site or any link contained in a linked site, or any changes or updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission received from any linked site. Microsoft is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement of Microsoft of the site or the products contained therein.
© 2018 Microsoft Corporation. All rights reserved.
Microsoft and the trademarks listed at https://www.microsoft.com/en-us/legal/intellectualproperty/Trademarks/Usage/General.aspx are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.
Azure Databricks Hands-on Lab 1
Abstract and learning objectives 1
Task1: Install Azure Storage Explorer 2
Task2: Clone the GitHub repository 4
Task3: Deploy Azure Resource Group 5
Task4: Deploy an Azure Storage Account as below 6
Task5: Deploy Azure Databricks 7
Exercise 1: Azure Databricks Fundamentals 9
Task1: Open Azure Databricks Workspace 9
Task2: Import the Databricks Notebooks 9
Exercise2: Follow the notebooks in order 14
In this workshop, you will deploy an Azure Databricks workspace and experiment with access data from Azure Storage in form of CSV or Parquet files and transform this data using PySpark and SparkSQL.
Note: This lab is designed to complement the classroom presentation and the instructor will guide the attendees which exercises should be attempted in which order.
Note on naming conventions: If you are using a shared environment (subscription) follow this naming convention to make sure your resources are easily identifiable from other participants. hol-<your-name>-<resource-name>
You will be provisioning the below resources:
-
Resource Group (If your environment has not already provided this). : hol-andrew-rg
-
Azure Storage Account : holandrewstorage – For storage name only alphanumeric is allow.
-
Azure Databricks workspace : hol-andrew-databricks
Microsoft Azure Storage Explorer is a standalone app that makes it easy to work with Azure Storage data on Windows, macOS, and Linux (Azure Docs on Storage Explorer here)
-
Download Azure Storage Explorer from this page
-
Install Azure Storage Explorer
-
Click the Open Connect dialog
-
Select “Add an Azure account” and continue logging in to your Azure account.
-
Once you log in you can see a list of your storage accounts.
Clone this GitHub repository to your machine or download the source Zip file of this repo
- Search for Resource groups from the main search bar in Azure portal
- Click Add
- For Resource group name: hol-<your name>-rg for example: hol-john-rg
-
Search for Storage Accounts
-
Click Add
-
Provide subscription, resource group
-
For Storage account name: hol<your name>storage for example: holjohnstorage
-
Select Location
-
For Account Kind Storage V2
-
For Replication change to LRS
-
Click “Review + Create” and then “Create”
-
Search for “Databricks” in Azure portal
-
Click Add
-
In Creation screen:
-
Select your subscription
-
Select your Resource Group
-
Provide the name hol-<yourname>-databricks
-
Select Location
-
Pricing Tier: Trial
-
click Review+Create
-
Click Create
-
-
Go to the Azure Databricks resource in Azure portal
-
Click “Launch Workspace” to open the Databricks Workspace
-
From left hand bar select “Workspace”
-
Click the downward arrow next to workspace then and click import
-
Select “File” and Browse to upload the DBC (Databricks Archive file)
-
From the folder that you cloned/downloaded the GitHub repo under “Databricks Notebooks” select “Azure_Databricks_Workshop
-
Click “Import”
-
Once Import is done you should be able to see Azure Databricks Workshop directory which has two sub-directories in it.
Start with Notebook 01 under “01-Databricks fundamentals” and go through the notebooks.