-
Notifications
You must be signed in to change notification settings - Fork 15
Adding profile scan config set up based examples #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
evasharma
wants to merge
19
commits into
main
Choose a base branch
from
datascan-blog-work-evash
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
26f2dbb
initial commit for datascan blog
evasharma 4fa77db
adding markup file to describe scan details
evasharma 182efea
Update blog_1_data_profile_configuration_details.md
evasharma e7a8bf1
add images folder to datascan to keep blog examples related supportin…
evasharma df0d5f8
Merge branch 'datascan-blog-work-evash' of https://github.com/GoogleC…
evasharma fcdb8e5
Update blog_1_data_profile_configuration_details.md
evasharma 1f375e2
Update blog_1_data_profile_configuration_details.md
evasharma a1340d7
fixing images spelling
evasharma 7b6969d
Update blog_1_data_profile_configuration_details.md
evasharma 3c5dbf4
Update blog_1_data_profile_configuration_details.md
evasharma 772a7eb
Update blog_1_data_profile_configuration_details.md
evasharma 6a5c4d2
add more images related to profile example
evasharma 01be26b
fix the name of an image file
evasharma 9ce96b9
fix the location of column filter image
evasharma 11d5157
Update blog_1_data_profile_configuration_details.md
evasharma fa04b71
Update blog_1_data_profile_configuration_details.md
evasharma 9248195
adding more images for blog
evasharma 3af8f26
Merge branch 'datascan-blog-work-evash' of https://github.com/GoogleC…
evasharma a927c2f
Update blog_1_data_profile_configuration_details.md
evasharma File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
68 changes: 68 additions & 0 deletions
68
datascan/datascan-blog-examples/blog_1_data_profile_configuration_details.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| ##### [How to Use Dataplex Data Profile to Unleash the Power of Your Enterprise Data](link) | ||
|
|
||
| Detailed instructions for setting up data profile scan for example scenario in the blog. | ||
|
|
||
| ###### _Display Name_ | ||
| Provide a descriptive name for your profile scan. Idea is to keep this as unique as possible, since this is used to auto-generate a unique ID for your profile scan. Let’s choose, `inventory items scan`. | ||
|
|
||
|  | ||
|
|
||
| ###### _Table to scan_ | ||
| Here you can directly specify the path to your BigQuery table since your data is structured data and already organized into a BigQuery table. | ||
|
|
||
|  | ||
|
|
||
| ###### _Scope_ | ||
| Here you can specify the scope of your scan. It can either be `Entire data`, where the profile scan runs on the whole table every single time. Or it can be `Incremental`, where each scan starts from the end of the previous scan. | ||
|
|
||
| Setting the scope to `Entire data` is useful if you a) don’t intend to receive more data into the source table b) only want to run a one-off scan to just get an initial summary of the data. | ||
|
|
||
| For `Incremental` scans, since the scan needs to keep a history of the last scanned row, you need to specify an unnested column from our source table of `Timestamp` or `Date` type. This should be a `Required` column with values which **monotonically** increases over time. | ||
|
|
||
| Since, in this case, you expect the source table to be updated with ~200 rows every day and you are interested in tracking the insights from the table on a recurring basis, it makes sense to set the scope to `Incremental` and select the `ingestion_timestamp` column as the `Timestamp` column. | ||
|
|
||
|  | ||
|
|
||
| ###### _Filter rows_ | ||
| You can specify a filter in the form of a SQL expression to filter the rows based on your condition. This SQL expression should be a valid expression for a WHERE clause in BigQuery standard SQL syntax. These filters will be applied every time this scan runs on the source table. | ||
|
|
||
| Let’s say, the product team’s requirement is to only consider data for `distribution center id` greater than 1. So our row filter condition can be `product_distribution_center_id > 1.` | ||
|
|
||
| You could also leverage row filters to filter out older data in your tables. This can be particularly useful if you have large tables with legacy data that is not particularly interesting from a monitoring perspective. Recall that the product team also wanted to ignore all the inventory data created before 2019. So an additional row filter condition can be `created_at > TIMESTAMP('2018-12-31')`. | ||
|
|
||
| The final row filter condition will be `product_distribution_center_id > 1 and created_at > TIMESTAMP('2018-12-31')` | ||
|
|
||
|  | ||
|
|
||
| ###### _Filter Columns_ | ||
| Additionally, you can also filter out specific columns to be scanned by this profile scan. This is particularly useful if you have a prior knowledge of which columns will be particularly interesting to scan. | ||
|
|
||
| For instance, in your case, you know that the ingestion_timestamp column is a required Timestamp column and will not provide any useful profiling information. You can filter out this column by specifying it in the excluded column list. Alternatively, you could specify the columns that you want to be included in the profile scan in the included column list. | ||
|
|
||
| Here, we will exclude the column ingestion_timestamp since we already know its values and are filtering on this column. | ||
|
|
||
|  | ||
|
|
||
| ###### _Sampling size_ | ||
| Another way to filter the data to be scanned is to specify a sampling size. If specified, the profile scan result will be based on the sampled data. Sampling is applied after the above two filters are applied. | ||
|
|
||
| Sampling is particularly useful if you expect a large amount of data to be seen for each scan. Specifying a smaller sampling size for such data would provide cost benefits. Choosing the sampling size appropriate for the overall data size to be seen per scan would cause more accurate profile insights. | ||
|
|
||
| Since you only expect ~200 rows to be scanned everyday, you can skip configuring sampling size for this scan. | ||
|
|
||
|  | ||
|
|
||
| ###### _Schedule_ | ||
| You can either create an `On-demand` scan which only runs when you explicitly run it or you can specify a `Schedule` to run this scan regularly at a particular time. | ||
|
|
||
| Creating a `Repeat` scheduled scan allows you to automatically trigger a scan around a specific event such as data ingestion time. Since you expect our data to be ingested daily at 8 AM PDT, you can schedule the profile scan to run everyday at 5 PM PDT. This would enable us to gather insights from the data daily. | ||
|
|
||
|  | ||
|
|
||
| ###### _Export scan results to Bigquery table_ | ||
|
|
||
| In `Additional Settings`, you can specify the path to a BigQuery table to keep exporting your profile results for each scan for further analysis. This is particularly useful for building more advanced dashboards using Looker Studio or building upstream detection or forecasting systems leveraging BQML models on profile scan results. | ||
|
|
||
| For this example, let's assume you want to store results `datascan-test-project1.datascan_blog_examples.datascan_inventory_profile_results` table, but this table doesn't exist yet. So, you can specify the path to the dataset `datascan-test-project1.datascan_blog_examples` and give the table name as `datascan_inventory_profile_results`. If this table doesn't exist, it will be created. | ||
|
|
||
|  |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| This directory contains scripts and setup for datascan blog examples. |
Binary file added
BIN
+10.1 KB
datascan/datascan-blog-examples/images/profile_scan_column_filter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+30.3 KB
datascan/datascan-blog-examples/images/profile_scan_export_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+11.7 KB
datascan/datascan-blog-examples/images/profile_scan_sampling_size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we are already in the datascan folder, we could shorten "datascan-blog-examples" to "blog-examples".