Replies: 1 comment
-
Thanks, Andrew!
I will try to take a closer look soon.
…--
??Evan Thoms
US Geological Survey, Alaska Science Center
4210 University Drive, Anchorage, AK 99508
________________________________
From: Andrew L. Wunderlich ***@***.***>
Sent: Tuesday, November 28, 2023 11:23 AM
To: DOI-USGS/gems-tools-pro ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [EXTERNAL] [DOI-USGS/gems-tools-pro] SCRIPT TO SHARE: Replace SourceID values within a GeMS database AND its stand-alone metadata (txt or xml) (Discussion #91)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
As our GeMS database development experience has increased and more databases are being produced, we have (inevitably) found that certain aspects of the production process have evolved. One thing that has changed from the early GeMS days is the way we handle SourceIDs. We used to use the simple DAS001, DAS002, DAS... style of SourceIDs, but they are problematic. First and foremost, they are not unique across databases, which, as we move toward an enterprise-type database to house all of our maps, will be a real headache. Second, this style of SourceIDs is not very meaningful when read by a human technician.
Thankfully, we were consistent with the SourceIDs across the different databases when it came to common sources, e.g., the TGS definition of a glossary term was always set to "DAS100". But "DAS100" is not a meaningful term and we are now using "TGS_GLODEF" as the SourceID in this example. We also always used the convention "DAS001" for "This report, " and followed with DAS002, DAS003, etc. for additional sources unique to the database. We are now using a more easily identifiable ID tied to the unique quadrangle ID for each map, i.e., "DAS001" becomes "TGSGQM_429NE_001" and so on.
So, we wanted to fix this issue before it got out of hand: go through the older databases and replace the old IDs with the new IDs. The problem comes when you need to find and replace every instance of a SourceID in every table in a database. And then eight other IDs (or 12, or 30). And then in multiple databases. And don't forget the metadata! Oh my! This can get out of hand very quickly...
Solution: A script based conceptually on the excellent Attribute by Key Values tool by Ralph Haugerud. The main idea taken from it being to have a text file with a list of key: value pairs to find and replace. The tool uses the "gdb_object_dict" GeMS utility function to inventory the database and a regex expression to find and replace the key: value pairs in the ENTIRE database AND the stand-alone metadata ALL AT ONCE with very little code. It can replace dozens of pairs of values in every table in a database and the metadata file in a matter of seconds...
This is not a polished product, so use at your own risk. But use it, test it, break it. I think it has potential to be a really powerful tool and a major time saver. The Python code for the script is followed by an example of the text file needed to supply the key: value pairs to the tool. Read the descriptive text in the code comments for more info on how to use it.
# GeMS_ReplaceSourceIDs.py
# Andrew L. Wunderlich
# ***@***.***
# 11/22/2023
"""
Steps through feature classes in a GeMS geodatabase and finds/replaces SourceIDs.
This tool has two main inputs, a GeMS-style geodatabase and a text file describing pairs of values to find/replace in
any (and ONLY in) GeMS-y SourceID fields in the database (e.g., DataSourceID, DescriptionSourceID, LocationSourceID,
etc.) It utilizes the GeMS_utilityFunctions.py to inventory the database and uses the inventory to find any fields
ending in "SourceID". The user can specify whether values should also be replaced in the DataSources table
DataSources_ID field. Optionally, the user can choose to process a stand-alone metadata file as well, which helps
maintain continuity between SourceIDs used in the geodatabase and those listed in the metadata.
The replacement values file is a simple pipe-delimited ("|") text file listing pairs of values, the first being the
value to find, the second being its replacement, e.g., "DAS001|TGSGM_429SE_001". Each pair of values must be on its own
line and there must not be more than one delimiter (one pair of values) per line (A|B is OK, A|B|C will be skipped). See
the Resources folder for an example replacement values file and more information.
Within the geodatabase SourceID fields, each key is replaced with its corresponding value. The tool can handle features
that have multiple IDs in one field, e.g., key: value pairs A|1 and B|2 will replace a field value "A|B" with "1|2" or
"A|C|B" with "1|C|2". The find/replace uses regex to find and replace all key: value pairs in one pass for each row in
the table being processed and is set to match whole words only to prevent key substrings from replacing parts of other
keys, i.e., "DAS001|ABC-123" won't erroneously find/replace "DAS001Z" with "ABC-123Z".
If the option to replace values in a stand-alone metadata file is selected, the "Metadata file" field will become
visible in the tool dialog and the user can choose either a txt or xml metadata file to process. The logic is the same
in the file as the database and uses the same regex command to do the find/replace. Structure of XML or formatted text
files is not affected by the read/write and files that were FDGC-compliant before processing tested with USGS MP
(https://www1.usgs.gov/mp/) after being modified by this tool returned no errors.
This tool modifies the input geodatabase and the metadata file (if optioned). ALWAYS back up your
datasets before using a tool that modifies any of the inputs!!
Script setup for ArcGIS Pro toolbox:
Name: ReplaceSourceIDs
Label: Replace SourceIDs
Description: Replace SourceIDs in a GeMS-style geodatabase and optionally a stand-alone metadata
file (txt or xml) using a delimited text file to identify find/replace pairs.
Store tool with relative path: YES
ArcGIS Pro Tool Properties required Parameters:
Input geodatabase - Workspace; Required; Input
Replacement values file - File; Required; Input; Filter: File-txt
Ignore DataSources_IDs - Boolean; Optional; Input
Replace values in stand-alone metadata file - Boolean ; Optional; Input
Metadata file - File; Optional; Input; Filter: File-txt;xml
Paste the following code into the Validation parameters replacing the default "def updateParameters(self)" section:
def updateParameters(self):
# Modify parameter values and properties.
# This gets called each time a parameter is modified, before
# standard validation.
if str(self.params[3].value).lower() == "true":
self.params[4].enabled = True
else:
self.params[4].enabled = False
self.params[4].value = ""
return
"""
import arcpy
import importlib
import re
import sys
import shutil
import os
import GeMS_utilityFunctions as GuF
# Usage string to print
usage = """
Usage: GeMS_ReplaceSourceIDs.py <Input geodatabase> <Replacement values file> <Ignore
DataSources_IDs> <Replace values in standalone metadata file> <Metadata file>
<Input geodatabase> is an GeMS-style geodatabase (modified by the tool).
<Replacement values file> is a formatted text file that specifies pairs of IDs:
old DAS IDs to search for and new DAS IDs to replace them separated by a
delimiter ("|" by default). Key (search) items must be unique otherwise the
last duplicate entry will be used by the tool. See ReplaceDASValues.txt in
the GeMS Resources folder for an example and format instructions.
<Ignore DataSources_IDs> boolean (True/False with or without quotes) that
determines if values in the DataSources table DataSources_ID field will be
searched and replaced (False) or not (True; default).
<Replace values in standalone metadata file> boolean (True/False with or
without quotes) that will determine if existing values in a stand alone
metadata file will also be searched and replaced (True) or not (False; default)
<Metadata file> is a formatted text or xml file of the FGDC metadata for the
geodatabase being processed. This file is modified by the tool.
"""
### This value is used to parse the input text file. Default is "|"
# Could be made into a sys.argv[] to allow for more robust find/replace
separator = "|"
### Gather info from user inputs
# Input geodatabase
gdb = sys.argv[1]
# Read raw text from text file (values to find and replace)
keyLinesRaw = open(sys.argv[2], "r", encoding="utf-8").readlines()
#GuF.addMsgAndPrint("Raw text read from input file: " + str(keyLinesRaw))
# Check if user specified to ignore DataSources_ID
if sys.argv[3].lower() == "true":
noSource = True
else:
noSource = False
# Check if user specified a metadata file to replace values
if sys.argv[4].lower() == "true":
replaceMeta = True
else:
replaceMeta = False
# If true, get metadata file name
metaFile = sys.argv[5]
if replaceMeta:
if len(metaFile) > 3:
GuF.addMsgAndPrint("Metadata file: " + metaFile)
else:
GuF.addMsgAndPrint("Metadata file name is invalid: " + metaFile)
replaceMeta = False
# Remove empty lines, leading/trailing spaces, and comments from keyLinesRaw
# Create an empty list to house the value pairs
keyLines = []
for lin in keyLinesRaw:
lin = lin.strip()
if len(lin) > 1 and lin[0:1] != "#":
keyLines.append(lin)
# Prepare values to use for find-replace as a dictionary
# Create an empty dictionary to house the {key: value} pairs read from the text file
GuF.addMsgAndPrint("Reading input text file for find/replace pairs...")
replaceDict = {}
n = 0
while n < len(keyLines):
# Separate the values in keyLines using the separator specified above
# Currently does not handle delimiter being included in either key or value
vals = keyLines[n].split(separator)
if len(vals) == 2:
for i in range(len(vals)):
# Strip out quotes
vals[i] = vals[i].replace("'", "")
vals[i] = vals[i].replace('"', "")
# Remove leading and trailing whitespace within each value
vals[i] = vals[i].strip()
# Add the vals list items to the dict as a key and value pair (k: v)
replaceDict[vals[0]] = vals[1]
else:
GuF.addMsgAndPrint("Line:\n {}\nhas wrong number of values. Skipping...".format(keyLines[n]))
n = n + 1
# Set the workspace
arcpy.env.workspace = gdb
# Inventory gdb and get the list of fields that end in "SourceID" or "Sources_ID"
d = GuF.gdb_object_dict(gdb)
# Inform the user of their choice to ignore the IDs in the DataSources table
if noSource:
GuF.addMsgAndPrint("Ignoring DataSources_ID values in DataSources table")
else:
GuF.addMsgAndPrint("DataSources_ID values in DataSources table will be replaced")
# Look through the database inventory and find all the SourceID fields, then use regex to find/replace
for k, v in d.items():
if "fields" in v:
fields = [f.name for f in v["fields"]]
itemCount = 0
for field in fields:
if noSource:
if field.endswith("SourceID"):
# DEBUG Print the feature class and name of field
GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field)
# Set the feature class to work with based on the field's catalogPath
updateTable = v["catalogPath"]
# Set the field to work with
updateFieldList = [field]
# Use an UpdateCursor to target the class and field with values to replace
with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows:
for updateRow in updateRows:
# Store the value of the row being searched in fieldValue variable
fieldValue = updateRow[0]
#GuF.addMsgAndPrint(" Original field contents: " + fieldValue)
# Use regex to find replaceDict keys and replace with matching value
# This method is extremely efficient and replaces instances of all keys
# in one pass instead of searching for each key one at a time. It also
# matches whole words only to reduce the chances of mistakes with keys
# that are substrings of another, e.g., DAS001 and DAS0011.
keysGdb = (re.escape(g) for g in replaceDict.keys())
patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b')
resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue)
# Replace the replaceDict keys with matching values and put in updateRow field.
if fieldValue == resultGdb:
#GuF.addMsgAndPrint(" No matching key found in string: " + fieldValue)
itemCount = itemCount
else:
#GuF.addMsgAndPrint(" Replaced field contents with: " + resultGdb)
itemCount = itemCount + 1
updateRow[0] = resultGdb
updateRows.updateRow(updateRow)
GuF.addMsgAndPrint(" Updated values in " + str(itemCount) + " records")
else:
if field.endswith("SourceID") or field.endswith("Sources_ID"):
# DEBUG Print the feature class and name of field
GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field)
# Set the feature class to work with based on the field's catalogPath
updateTable = v["catalogPath"]
# Set the field to work with
updateFieldList = [field]
# Use an UpdateCursor to target the class and field with values to replace
with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows:
for updateRow in updateRows:
# Store the value of the row being searched in fieldValue variable
fieldValue = updateRow[0]
#GuF.addMsgAndPrint(" Original field contents: " + fieldValue)
# Use regex to find replaceDict keys and replace with matching value
# This method is extremely efficient and replaces instances of all keys
# in one pass instead of searching for each key one at a time. It also
# matches whole words only to reduce the chances of mistakes with keys
# that are substrings of another, e.g., DAS001 and DAS0011.
keysGdb = (re.escape(g) for g in replaceDict.keys())
patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b')
resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue)
# Replace the replaceDict keys with matching values and put in updateRow field.
if fieldValue == resultGdb:
# GuF.addMsgAndPrint(" No matching key found in string: " + fieldValue)
itemCount = itemCount
else:
# GuF.addMsgAndPrint(" Replaced field contents with: " + resultGdb)
itemCount = itemCount + 1
updateRow[0] = resultGdb
updateRows.updateRow(updateRow)
GuF.addMsgAndPrint(" Updated values in " + str(itemCount) + " records")
# Open the user-specified metadata file and do the find/replace
if replaceMeta:
try:
GuF.addMsgAndPrint("Reading metadata file: " + os.path.split(metaFile)[1])
with open(metaFile, "r", encoding="utf8") as mF:
mFData = mF.read()
# Replace the keys with matching values
GuF.addMsgAndPrint(" Finding keys and replacing with values...")
keysMeta = (re.escape(g) for g in replaceDict.keys())
patternMeta = re.compile(r'\b(' + '|'.join(keysMeta) + r')\b')
resultMeta = patternMeta.sub(lambda u: replaceDict[u.group()], mFData)
# Rewrite the metadata file
GuF.addMsgAndPrint(" Writing updates to " + os.path.split(metaFile)[1])
with open(metaFile, "w", encoding="utf8") as mF:
mF.write(resultMeta)
GuF.addMsgAndPrint("Metadata file updated successfully!")
except FileNotFoundError:
GuF.addMsgAndPrint("Invalid metadata file: " + metaFile)
Example of the text file (ReplaceSourceIDValues.txt):
# Example find/replace file for use with GeMS_ReplaceSourceIDs.py
#
# The value on the left is the search term (key). The value on
# the right is the replace term (value).
DAS100|NEW_DEF1
DAS101|NEW_DEF2
DAS102|NEW_DEF3
DAS107|FGDC-STD-013-2006
GSABUL129|GSA_BULL-129(7-8)
# NOTES:
# Keys must be unique in the list. Duplicate keys will result in the
# last instance of the key having its value assigned to the pair:
# AB|12
# AB|34 <- this pair will be kept
#
# Empty lines are OK (they are ignored).
# Lines that begin with # symbol are comments and are ignored.
# Don't put comments at the end of a line. # like this
# Within a line, values must be separated by delimiter. | is default.
# Lines with more than one delimiter will be ignored; they are
# interpreted as having an incorrect number of arguments.
#
# Avoid punctuation within a key or value. The script uses regex to
# do the substitutions so periods, semicolons, etc. could produce
# unexpected/undesireable results when searching for whole words.
#
# Leading and trailing whitespaces (spaces, tabs) between values and
# delimiters are ignored: "AB | 12" searches for "AB" and replaces
# with "12".
#
# Spaces within a key or value are OK: "AB CD|12 34"
# If you need to replace a value that contains the default delimiter,
# change the tool delimiter, e.g., to replace "AB|12" with "CD",
# edit the script delimiter to "," and make the pair "AB|12,CD"
#
# If you have a list of {key: value} pairs in a spreadsheet,
# save it in csv format, specifying the | character as the delimiter
# or another delimiter if keys or values contain the default
# delimiter.
—
Reply to this email directly, view it on GitHub<#91>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABJASOZISL7XAP4IBDM3DCLYGZB2NAVCNFSM6AAAAAA76LBD36VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHEYDGMBUHE>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As our GeMS database development experience has increased and more databases are being produced, we have (inevitably) found that certain aspects of the production process have evolved. One thing that has changed from the early GeMS days is the way we handle SourceIDs. We used to use the simple DAS001, DAS002, DAS... style of SourceIDs, but they are problematic. First and foremost, they are not unique across databases, which, as we move toward an enterprise-type database to house all of our maps, will be a real headache. Second, this style of SourceIDs is not very meaningful when read by a human technician.
Thankfully, we were consistent with the SourceIDs across the different databases when it came to common sources, e.g., the TGS definition of a glossary term was always set to "DAS100". But "DAS100" is not a meaningful term and we are now using "TGS_GLODEF" as the SourceID in this example. We also always used the convention "DAS001" for "This report, " and followed with DAS002, DAS003, etc. for additional sources unique to the database. We are now using a more easily identifiable ID tied to the unique quadrangle ID for each map, i.e., "DAS001" becomes "TGSGQM_429NE_001" and so on.
So, we wanted to fix this issue before it got out of hand: go through the older databases and replace the old IDs with the new IDs. The problem comes when you need to find and replace every instance of a SourceID in every table in a database. And then eight other IDs (or 12, or 30). And then in multiple databases. And don't forget the metadata! Oh my! This can get out of hand very quickly...
Solution: A script based conceptually on the excellent Attribute by Key Values tool by Ralph Haugerud. The main idea taken from it being to have a text file with a list of key: value pairs to find and replace. The tool uses the "gdb_object_dict" GeMS utility function to inventory the database and a regex expression to find and replace the key: value pairs in the ENTIRE database AND the stand-alone metadata ALL AT ONCE with very little code. It can replace dozens of pairs of values in every table in a database and the metadata file in a matter of seconds...
This is not a polished product, so use at your own risk. But use it, test it, break it. I think it has potential to be a really powerful tool and a major time saver. The Python code for the script is followed by an example of the text file needed to supply the key: value pairs to the tool. Read the descriptive text in the code comments for more info on how to use it.
Example of the text file (ReplaceSourceIDValues.txt):
Beta Was this translation helpful? Give feedback.
All reactions