Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
26f2dbb
initial commit for datascan blog
evasharma Sep 21, 2023
4fa77db
adding markup file to describe scan details
evasharma Oct 16, 2023
182efea
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
e7a8bf1
add images folder to datascan to keep blog examples related supportin…
evasharma Oct 16, 2023
df0d5f8
Merge branch 'datascan-blog-work-evash' of https://github.com/GoogleC…
evasharma Oct 16, 2023
fcdb8e5
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
1f375e2
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
a1340d7
fixing images spelling
evasharma Oct 16, 2023
7b6969d
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
3c5dbf4
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
772a7eb
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
6a5c4d2
add more images related to profile example
evasharma Oct 16, 2023
01be26b
fix the name of an image file
evasharma Oct 16, 2023
9ce96b9
fix the location of column filter image
evasharma Oct 16, 2023
11d5157
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
fa04b71
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
9248195
adding more images for blog
evasharma Oct 16, 2023
3af8f26
Merge branch 'datascan-blog-work-evash' of https://github.com/GoogleC…
evasharma Oct 16, 2023
a927c2f
Update blog_1_data_profile_configuration_details.md
evasharma Oct 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we are already in the datascan folder, we could shorten "datascan-blog-examples" to "blog-examples".

Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
##### [How to Use Dataplex Data Profile to Unleash the Power of Your Enterprise Data](link)

Detailed instructions for setting up data profile scan for example scenario in the blog.

###### _Display Name_
Provide a descriptive name for your profile scan. Idea is to keep this as unique as possible, since this is used to auto-generate a unique ID for your profile scan. Let’s choose, `inventory items scan`.

![Define Scan](./images/profile_scan_display_name.png "Define Scan")

###### _Table to scan_
Here you can directly specify the path to your BigQuery table since your data is structured data and already organized into a BigQuery table.

![Table to scan](./images/profile_scan_data_source.png "Table to scan")

###### _Scope_
Here you can specify the scope of your scan. It can either be `Entire data`, where the profile scan runs on the whole table every single time. Or it can be `Incremental`, where each scan starts from the end of the previous scan.

Setting the scope to `Entire data` is useful if you a) don’t intend to receive more data into the source table b) only want to run a one-off scan to just get an initial summary of the data.

For `Incremental` scans, since the scan needs to keep a history of the last scanned row, you need to specify an unnested column from our source table of `Timestamp` or `Date` type. This should be a `Required` column with values which **monotonically** increases over time.

Since, in this case, you expect the source table to be updated with ~200 rows every day and you are interested in tracking the insights from the table on a recurring basis, it makes sense to set the scope to `Incremental` and select the `ingestion_timestamp` column as the `Timestamp` column.

![Scope](./images/profile_scan_scope.png "Scope")

###### _Filter rows_
You can specify a filter in the form of a SQL expression to filter the rows based on your condition. This SQL expression should be a valid expression for a WHERE clause in BigQuery standard SQL syntax. These filters will be applied every time this scan runs on the source table.

Let’s say, the product team’s requirement is to only consider data for `distribution center id` greater than 1. So our row filter condition can be `product_distribution_center_id > 1.`

You could also leverage row filters to filter out older data in your tables. This can be particularly useful if you have large tables with legacy data that is not particularly interesting from a monitoring perspective. Recall that the product team also wanted to ignore all the inventory data created before 2019. So an additional row filter condition can be `created_at > TIMESTAMP('2018-12-31')`.

The final row filter condition will be `product_distribution_center_id > 1 and created_at > TIMESTAMP('2018-12-31')`

![Filter rows](./images/profile_scan_row_filter.png "Filter rows")

###### _Filter Columns_
Additionally, you can also filter out specific columns to be scanned by this profile scan. This is particularly useful if you have a prior knowledge of which columns will be particularly interesting to scan.

For instance, in your case, you know that the ingestion_timestamp column is a required Timestamp column and will not provide any useful profiling information. You can filter out this column by specifying it in the excluded column list. Alternatively, you could specify the columns that you want to be included in the profile scan in the included column list.

Here, we will exclude the column ingestion_timestamp since we already know its values and are filtering on this column.

![Filter columns](./images/profile_scan_column_filter.png "Filter columns")

###### _Sampling size_
Another way to filter the data to be scanned is to specify a sampling size. If specified, the profile scan result will be based on the sampled data. Sampling is applied after the above two filters are applied.

Sampling is particularly useful if you expect a large amount of data to be seen for each scan. Specifying a smaller sampling size for such data would provide cost benefits. Choosing the sampling size appropriate for the overall data size to be seen per scan would cause more accurate profile insights.

Since you only expect ~200 rows to be scanned everyday, you can skip configuring sampling size for this scan.

![Sampling Size](./images/profile_scan_sampling_size.png "Sampling Size")

###### _Schedule_
You can either create an `On-demand` scan which only runs when you explicitly run it or you can specify a `Schedule` to run this scan regularly at a particular time.

Creating a `Repeat` scheduled scan allows you to automatically trigger a scan around a specific event such as data ingestion time. Since you expect our data to be ingested daily at 8 AM PDT, you can schedule the profile scan to run everyday at 5 PM PDT. This would enable us to gather insights from the data daily.

![Schedule](./images/profile_scan_schedule.png "Schedule")

###### _Export scan results to Bigquery table_

In `Additional Settings`, you can specify the path to a BigQuery table to keep exporting your profile results for each scan for further analysis. This is particularly useful for building more advanced dashboards using Looker Studio or building upstream detection or forecasting systems leveraging BQML models on profile scan results.

For this example, let's assume you want to store results `datascan-test-project1.datascan_blog_examples.datascan_inventory_profile_results` table, but this table doesn't exist yet. So, you can specify the path to the dataset `datascan-test-project1.datascan_blog_examples` and give the table name as `datascan_inventory_profile_results`. If this table doesn't exist, it will be created.

![Export scan results to Bigquery Table](./images/profile_scan_export_results.png "Export scan results to Bigquery Table")
1 change: 1 addition & 0 deletions datascan/datascan-blog-examples/data-setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This directory contains scripts and setup for datascan blog examples.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.