-
Notifications
You must be signed in to change notification settings - Fork 1
/
00_getting_started.py
114 lines (89 loc) · 3.34 KB
/
00_getting_started.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Databricks notebook source
# MAGIC %md-sandbox
# MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
<img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>
# COMMAND ----------
# MAGIC %md
# MAGIC # Getting Started
# COMMAND ----------
# MAGIC %md
# MAGIC ## Configuration
# COMMAND ----------
# MAGIC %run ./includes/utilities
# COMMAND ----------
# MAGIC %md
# MAGIC ## Make Notebook Idempotent
# MAGIC
# MAGIC This step will make the notebook
# MAGIC [idempotent](https://stackoverflow.com/a/1077421/1081801). In other
# MAGIC words, the notebook could be run more than once without throwing errors
# MAGIC or introducing extra files.
# COMMAND ----------
dbutils.fs.rm(projectPath, recurse=True)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Retrieve and Load the Data
# MAGIC
# MAGIC We will be working with two files:
# MAGIC
# MAGIC - "health_profile_data.snappy.parquet"
# MAGIC - "user_profile_data.snappy.parquet"
# MAGIC
# MAGIC These files can be retrieved and loaded using the utility function `process_file`
# MAGIC
# MAGIC This function takes three arguments:
# MAGIC
# MAGIC - `file_name: str`
# MAGIC - the name of the file to retrieve
# MAGIC - `path: str`
# MAGIC - the location to write the file as a Delta table
# MAGIC - `table_name: str`
# MAGIC - the name of a table to be used in the Metastore to reference the data
# MAGIC
# MAGIC This function does three things:
# MAGIC
# MAGIC 1. Retrieve a file and load it into your Databricks Workspace.
# MAGIC 1. Create a Delta table using the file.
# MAGIC 1. Register the Delta table in the Metastore so that it can be
# MAGIC referenced using SQL or a PySpark `table` reference.
# COMMAND ----------
# MAGIC %md
# MAGIC ### Retrieve and Load the Data
# MAGIC
# MAGIC Retrieve the data using the following arguments:
# MAGIC
# MAGIC | `file_name` | `path` | `table_name` |
# MAGIC |:-:|:-:|:-|
# MAGIC | `health_profile_data.snappy.parquet` | `silverDailyPath` | `health_profile_data` |
# MAGIC | `user_profile_data.snappy.parquet` | `dimUserPath` | `user_profile_data` |
# COMMAND ----------
# TODO
Use the utility function `process_file` to retrieve the data
Use the arguments in the table above.
process_file(
FILL_IN_FILE_NAME,
FILL_IN_PATH,
FILL_IN_TABLE_NAME
)
process_file(
FILL_IN_FILE_NAME,
FILL_IN_PATH,
FILL_IN_TABLE_NAME
)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Data Availability
# MAGIC
# MAGIC In a typical workflow, data will have been made available to you
# MAGIC as tables that can be queried using SQL or PySpark. This function
# MAGIC `process_file` has performed the steps necessary to make the files
# MAGIC available to your workspace so that you can focus on data science.
# MAGIC This mirrors a typical workflow, where the data has been made
# MAGIC available to you by a data engineer.
# COMMAND ----------
# MAGIC %md-sandbox
# MAGIC © 2020 Databricks, Inc. All rights reserved.<br/>
# MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
# MAGIC <br/>
# MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>