SCRIPT TO SHARE: Replace SourceID values within a GeMS database AND its stand-alone metadata (txt or xml) #91

alwunder · 2023-11-28T20:22:50Z

alwunder
Nov 28, 2023

As our GeMS database development experience has increased and more databases are being produced, we have (inevitably) found that certain aspects of the production process have evolved. One thing that has changed from the early GeMS days is the way we handle SourceIDs. We used to use the simple DAS001, DAS002, DAS... style of SourceIDs, but they are problematic. First and foremost, they are not unique across databases, which, as we move toward an enterprise-type database to house all of our maps, will be a real headache. Second, this style of SourceIDs is not very meaningful when read by a human technician.

Thankfully, we were consistent with the SourceIDs across the different databases when it came to common sources, e.g., the TGS definition of a glossary term was always set to "DAS100". But "DAS100" is not a meaningful term and we are now using "TGS_GLODEF" as the SourceID in this example. We also always used the convention "DAS001" for "This report, " and followed with DAS002, DAS003, etc. for additional sources unique to the database. We are now using a more easily identifiable ID tied to the unique quadrangle ID for each map, i.e., "DAS001" becomes "TGSGQM_429NE_001" and so on.

So, we wanted to fix this issue before it got out of hand: go through the older databases and replace the old IDs with the new IDs. The problem comes when you need to find and replace every instance of a SourceID in every table in a database. And then eight other IDs (or 12, or 30). And then in multiple databases. And don't forget the metadata! Oh my! This can get out of hand very quickly...

Solution: A script based conceptually on the excellent Attribute by Key Values tool by Ralph Haugerud. The main idea taken from it being to have a text file with a list of key: value pairs to find and replace. The tool uses the "gdb_object_dict" GeMS utility function to inventory the database and a regex expression to find and replace the key: value pairs in the ENTIRE database AND the stand-alone metadata ALL AT ONCE with very little code. It can replace dozens of pairs of values in every table in a database and the metadata file in a matter of seconds...

This is not a polished product, so use at your own risk. But use it, test it, break it. I think it has potential to be a really powerful tool and a major time saver. The Python code for the script is followed by an example of the text file needed to supply the key: value pairs to the tool. Read the descriptive text in the code comments for more info on how to use it.

# GeMS_ReplaceSourceIDs.py
# Andrew L. Wunderlich
# andrew.wunderlich@tn.gov
# 11/22/2023

# 12/5/2023 Fixed <Null> field value problem

"""
Steps through feature classes in a GeMS geodatabase and finds/replaces SourceIDs.

This tool has two main inputs, a GeMS-style geodatabase and a text file describing pairs of values to find/replace in
any (and ONLY in) GeMS-y SourceID fields in the database (e.g., DataSourceID, DescriptionSourceID, LocationSourceID,
etc.) It utilizes the GeMS_utilityFunctions.py to inventory the database and uses the inventory to find any fields
ending in "SourceID". The user can specify whether values should also be replaced in the DataSources table
DataSources_ID field. Optionally, the user can choose to process a stand-alone metadata file as well, which helps
maintain continuity between SourceIDs used in the geodatabase and those listed in the metadata.

The replacement values file is a simple pipe-delimited ("|") text file listing pairs of values, the first being the
value to find, the second being its replacement, e.g., "DAS001|TGSGM_429SE_001". Each pair of values must be on its own
line and there must not be more than one delimiter (one pair of values) per line (A|B is OK, A|B|C will be skipped). See
the Resources folder for an example replacement values file and more information.

Within the geodatabase SourceID fields, each key is replaced with its corresponding value. The tool can handle features
that have multiple IDs in one field, e.g., key: value pairs A|1 and B|2 will replace a field value "A|B" with "1|2" or
"A|C|B" with "1|C|2". The find/replace uses regex to find and replace all key: value pairs in one pass for each row in
the table being processed and is set to match whole words only to prevent key substrings from replacing parts of other
keys, i.e., "DAS001|ABC-123" won't erroneously find/replace "DAS001Z" with "ABC-123Z".

If the option to replace values in a stand-alone metadata file is selected, the "Metadata file" field will become
visible in the tool dialog and the user can choose either a txt or xml metadata file to process. The logic is the same
in the file as the database and uses the same regex command to do the find/replace. Structure of XML or formatted text
files is not affected by the read/write and files that were FDGC-compliant before processing tested with USGS MP
(https://www1.usgs.gov/mp/) after being modified by this tool returned no errors.

If the Verbose results option is selected the tool gives the user more detailed information about the find/replace by
displaying all matches (or lack thereof) on a row-by-row basis in the tool results window.

This tool modifies the input geodatabase and the metadata file (if optioned).  ALWAYS back up your
datasets before using a tool that modifies any of the inputs!!

Script setup for ArcGIS Pro toolbox:

Name: ReplaceSourceIDs
Label: Replace SourceIDs
Description: Replace SourceIDs in a GeMS-style geodatabase and optionally a stand-alone metadata
file (txt or xml) using a delimited text file to identify find/replace pairs.
Store tool with relative path: YES

ArcGIS Pro Tool Properties required Parameters:
Input geodatabase - Workspace; Required; Input
Replacement values file - File; Required; Input; Filter: File-txt
Ignore DataSources_IDs - Boolean; Optional; Input
Replace values in stand-alone metadata file - Boolean ; Optional; Input
Metadata file - File; Optional; Input; Filter: File-txt;xml
Verbose results - Boolean; Optional; Input

Paste the following code into the Validation parameters replacing the default "def updateParameters(self)" section:
    def updateParameters(self):
        # Modify parameter values and properties.
        # This gets called each time a parameter is modified, before
        # standard validation.
        if str(self.params[3].value).lower() == "true":
            self.params[4].enabled = True
        else:
            self.params[4].enabled = False
            self.params[4].value = ""
        return
"""

import arcpy
import re
import sys
import os
import GeMS_utilityFunctions as GuF

# Usage string to print
usage = """
Usage: GeMS_ReplaceSourceIDs.py <Input geodatabase> <Replacement values file> <Ignore 
    DataSources_IDs> <Replace values in standalone metadata file> <Metadata file> 
    <Verbose results>
    
    <Input geodatabase> is an GeMS-style geodatabase (modified by the tool).
    <Replacement values file> is a formatted text file that specifies pairs of IDs:
        old DAS IDs to search for and new DAS IDs to replace them separated by a 
        delimiter ("|" by default). Key (search) items must be unique otherwise the 
        last duplicate entry will be used by the tool. See ReplaceDASValues.txt in 
        the GeMS Resources folder for an example and format instructions.
    <Ignore DataSources_IDs> boolean (True/False with or without quotes) that 
        determines if values in the DataSources table DataSources_ID field will be 
        searched and replaced (False) or not (True; default).
    <Replace values in standalone metadata file> boolean (True/False with or
        without quotes) that will determine if existing values in a stand alone
        metadata file will also be searched and replaced (True) or not (False; default)
    <Metadata file> is a formatted text or xml file of the FGDC metadata for the 
        geodatabase being processed. This file is modified by the tool.
    <Verbose results> boolean (True/False with or without quotes) that determines 
        if the tool outputs more detailed info to the results window of what has 
        been replaced. 
"""

# Presumptive GeMS git info
versionString = "GeMS_ReplaceSourceIDs.py, version of 12/5/23"
GuF.addMsgAndPrint(versionString)

# This value is used to parse the input text file. Default is "|", change if needed.
# Could be made into a sys.argv[] to allow for more robust find/replace
separator = "|"

### Gather info from user inputs
# Input geodatabase
gdb = sys.argv[1]
# Read raw text from text file (values to find and replace)
keyLinesRaw = open(sys.argv[2], "r", encoding="utf-8").readlines()
#GuF.addMsgAndPrint("Raw text read from input file: " + str(keyLinesRaw))
# Check if user specified to ignore DataSources_ID
if sys.argv[3].lower() == "true":
    noSource = True
else:
    noSource = False
# Check if user specified a metadata file to replace values
if sys.argv[4].lower() == "true":
    replaceMeta = True
else:
    replaceMeta = False
# If true, get metadata file name
metaFile = sys.argv[5]
if replaceMeta:
    if len(metaFile) > 3:
        GuF.addMsgAndPrint("Metadata file: " + metaFile)
    else:
        GuF.addMsgAndPrint("Metadata file name is invalid: " + metaFile)
        replaceMeta = False
# Let user choose how much info is printed to the results window
if sys.argv[6].lower() == "true":
    resVerbose = True
else:
    resVerbose = False

# Remove empty lines, leading/trailing spaces, and comments from keyLinesRaw
# Create an empty list to house the value pairs
keyLines = []
for lin in keyLinesRaw:
    lin = lin.strip()
    if len(lin) > 1 and lin[0:1] != "#":
        keyLines.append(lin)

# Prepare values to use for find-replace as a dictionary
# Create an empty dictionary to house the {key: value} pairs read from the text file
GuF.addMsgAndPrint("Reading input text file for find/replace pairs...")
replaceDict = {}
n = 0
while n < len(keyLines):
    # Separate the values in keyLines using the separator specified above
    # Currently does not handle delimiter being included in either key or value
    vals = keyLines[n].split(separator)
    if len(vals) == 2:
        for i in range(len(vals)):
            # Strip out quotes
            vals[i] = vals[i].replace("'", "")
            vals[i] = vals[i].replace('"', "")
            # Remove leading and trailing whitespace within each value
            vals[i] = vals[i].strip()
            # Add the vals list items to the dict as a key and value pair (k: v)
            replaceDict[vals[0]] = vals[1]
    else:
        GuF.addMsgAndPrint("Line:\n   {}\nhas wrong number of values. Skipping...".format(keyLines[n]))
    n = n + 1

# Set the workspace
arcpy.env.workspace = gdb

# Inventory gdb and get the list of fields that end in "SourceID" or "Sources_ID"
GuF.addMsgAndPrint("Building database inventory...")
d = GuF.gdb_object_dict(gdb)
GuF.addMsgAndPrint("Inventory Complete!")

# Inform the user of their choice to ignore the IDs in the DataSources table
if noSource:
    GuF.addMsgAndPrint("Ignoring DataSources_ID values in DataSources table (if present)")
else:
    GuF.addMsgAndPrint("DataSources_ID values in DataSources table (if present) will be replaced")

# Look through the database inventory and find all the SourceID fields, then use regex to find/replace
for k, v in d.items():
    if "fields" in v:
        fields = [f.name for f in v["fields"]]
        itemCount = 0
        for field in fields:
            if noSource:
                if field.endswith("SourceID"):
                    # Print the feature class and name of field
                    GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field)
                    # Set the feature class to work with based on the field's catalogPath
                    updateTable = v["catalogPath"]
                    # Set the field to work with
                    updateFieldList = [field]
                    # Use an UpdateCursor to target the class and field with values to replace
                    with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows:
                        for updateRow in updateRows:
                            # Store the value of the row being searched in fieldValue variable
                            fieldValue = updateRow[0]
                            if resVerbose:
                                GuF.addMsgAndPrint("   Original field contents: " + str(fieldValue))
                            # Only find/replace if fieldValue is NOT <Null>
                            try:
                                # Use regex to find replaceDict keys and replace with matching value
                                # This method is extremely efficient and replaces instances of all keys
                                # in one pass instead of searching for each key one at a time. It also
                                # matches whole words only to reduce the chances of mistakes with keys
                                # that are substrings of another, e.g., DAS001 and DAS0011.
                                keysGdb = (re.escape(g) for g in replaceDict.keys())
                                patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b')
                                resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue)
                                # Replace the replaceDict keys with matching values and put in updateRow field.
                                if fieldValue == resultGdb:
                                    if resVerbose:
                                        GuF.addMsgAndPrint(
                                            "    None of the keys " + str(keysGdb) + " found in string: " + fieldValue)
                                    itemCount = itemCount
                                else:
                                    if resVerbose:
                                        GuF.addMsgAndPrint("    Field contents replaced with: " + resultGdb)
                                    itemCount = itemCount + 1
                                    updateRow[0] = resultGdb
                                    updateRows.updateRow(updateRow)
                            except TypeError:
                                if resVerbose:
                                    GuF.addMsgAndPrint("    Field does not contain a valid string (None=<Null>)")
                                itemCount = itemCount
                    GuF.addMsgAndPrint("   Updated values in " + str(itemCount) + " records")
            else:
                if field.endswith("SourceID") or field.endswith("Sources_ID"):
                    # Print the feature class and name of field
                    GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field)
                    # Set the feature class to work with based on the field's catalogPath
                    updateTable = v["catalogPath"]
                    # Set the field to work with
                    updateFieldList = [field]
                    # Use an UpdateCursor to target the class and field with values to replace
                    with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows:
                        for updateRow in updateRows:
                            # Store the value of the row being searched in fieldValue variable
                            fieldValue = updateRow[0]
                            if resVerbose:
                                GuF.addMsgAndPrint("   Original field contents: " + str(fieldValue))
                            # Only find/replace if fieldValue is NOT <Null>
                            try:
                                # Use regex to find replaceDict keys and replace with matching value
                                # This method is extremely efficient and replaces instances of all keys
                                # in one pass instead of searching for each key one at a time. It also
                                # matches whole words only to reduce the chances of mistakes with keys
                                # that are substrings of another, e.g., DAS001 and DAS0011.
                                keysGdb = (re.escape(g) for g in replaceDict.keys())
                                patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b')
                                resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue)
                                # Replace the replaceDict keys with matching values and put in updateRow field.
                                if fieldValue == resultGdb:
                                    if resVerbose:
                                        GuF.addMsgAndPrint(
                                            "    None of the keys " + str(keysGdb) + " found in string: " + fieldValue)
                                    itemCount = itemCount
                                else:
                                    if resVerbose:
                                        GuF.addMsgAndPrint("    Field contents replaced with: " + resultGdb)
                                    itemCount = itemCount + 1
                                    updateRow[0] = resultGdb
                                    updateRows.updateRow(updateRow)
                            except TypeError:
                                if resVerbose:
                                    GuF.addMsgAndPrint("    Field does not contain a valid string (None=<Null>).")
                                itemCount = itemCount
                    GuF.addMsgAndPrint("   Updated values in " + str(itemCount) + " records")

# Open the user-specified metadata file and do the find/replace
if replaceMeta:
    try:
        GuF.addMsgAndPrint("Reading metadata file: " + os.path.split(metaFile)[1])
        with open(metaFile, "r", encoding="utf8") as mF:
            mFData = mF.read()

        # Replace the keys with matching values
        GuF.addMsgAndPrint("   Finding keys and replacing with values...")
        keysMeta = (re.escape(g) for g in replaceDict.keys())
        patternMeta = re.compile(r'\b(' + '|'.join(keysMeta) + r')\b')
        resultMeta = patternMeta.sub(lambda u: replaceDict[u.group()], mFData)

        # Rewrite the metadata file
        GuF.addMsgAndPrint("   Writing updates to " + os.path.split(metaFile)[1])
        with open(metaFile, "w", encoding="utf8") as mF:
            mF.write(resultMeta)

        GuF.addMsgAndPrint("Metadata file updated successfully!")
    except FileNotFoundError:
        GuF.addMsgAndPrint("Invalid metadata file: " + metaFile)

Example of the text file (ReplaceSourceIDValues.txt):

# Example find/replace file for use with GeMS_ReplaceSourceIDs.py
#
# The value on the left is the search term (key). The value on
# the right is the replace term (value).

DAS100|NEW_DEF1
DAS101|NEW_DEF2
DAS102|NEW_DEF3
DAS107|FGDC-STD-013-2006
GSABUL129|GSA_BULL-129(7-8)

# NOTES:
# Keys must be unique in the list. Duplicate keys will result in the 
# last instance of the key having its value assigned to the pair:
# AB|12
# AB|34 <- this pair will be kept
#
# Empty lines are OK (they are ignored).
# Lines that begin with # symbol are comments and are ignored.
# Don't put comments at the end of a line. # like this
# Within a line, values must be separated by delimiter. | is default.
# Lines with more than one delimiter will be ignored; they are 
# interpreted as having an incorrect number of arguments.
#
# Avoid punctuation within a key or value. The script uses regex to 
# do the substitutions so periods, semicolons, etc. could produce 
# unexpected/undesirable results when searching for whole words.
#
# Leading and trailing whitespaces (spaces, tabs) between values and 
# delimiters are ignored: "AB | 12" searches for "AB" and replaces 
# with "12".
#
# Spaces within a key or value are OK: "AB CD|12 34"
# If you need to replace a value that contains the default delimiter,
# change the tool delimiter, e.g., to replace "AB|12" with "CD", 
# edit the script delimiter to "," and make the pair "AB|12,CD"
#
# If you have a list of {key: value} pairs in a spreadsheet, 
# save it in csv format, specifying the | character as the delimiter
# or another delimiter if keys or values contain the default 
# delimiter.

ethoms-usgs · 2023-11-28T21:08:56Z

ethoms-usgs
Nov 28, 2023
Maintainer

Thanks, Andrew! I will try to take a closer look soon.

…

-- ??Evan Thoms US Geological Survey, Alaska Science Center 4210 University Drive, Anchorage, AK 99508

________________________________ From: Andrew L. Wunderlich ***@***.***> Sent: Tuesday, November 28, 2023 11:23 AM To: DOI-USGS/gems-tools-pro ***@***.***> Cc: Subscribed ***@***.***> Subject: [EXTERNAL] [DOI-USGS/gems-tools-pro] SCRIPT TO SHARE: Replace SourceID values within a GeMS database AND its stand-alone metadata (txt or xml) (Discussion #91) This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding. As our GeMS database development experience has increased and more databases are being produced, we have (inevitably) found that certain aspects of the production process have evolved. One thing that has changed from the early GeMS days is the way we handle SourceIDs. We used to use the simple DAS001, DAS002, DAS... style of SourceIDs, but they are problematic. First and foremost, they are not unique across databases, which, as we move toward an enterprise-type database to house all of our maps, will be a real headache. Second, this style of SourceIDs is not very meaningful when read by a human technician. Thankfully, we were consistent with the SourceIDs across the different databases when it came to common sources, e.g., the TGS definition of a glossary term was always set to "DAS100". But "DAS100" is not a meaningful term and we are now using "TGS_GLODEF" as the SourceID in this example. We also always used the convention "DAS001" for "This report, " and followed with DAS002, DAS003, etc. for additional sources unique to the database. We are now using a more easily identifiable ID tied to the unique quadrangle ID for each map, i.e., "DAS001" becomes "TGSGQM_429NE_001" and so on. So, we wanted to fix this issue before it got out of hand: go through the older databases and replace the old IDs with the new IDs. The problem comes when you need to find and replace every instance of a SourceID in every table in a database. And then eight other IDs (or 12, or 30). And then in multiple databases. And don't forget the metadata! Oh my! This can get out of hand very quickly... Solution: A script based conceptually on the excellent Attribute by Key Values tool by Ralph Haugerud. The main idea taken from it being to have a text file with a list of key: value pairs to find and replace. The tool uses the "gdb_object_dict" GeMS utility function to inventory the database and a regex expression to find and replace the key: value pairs in the ENTIRE database AND the stand-alone metadata ALL AT ONCE with very little code. It can replace dozens of pairs of values in every table in a database and the metadata file in a matter of seconds... This is not a polished product, so use at your own risk. But use it, test it, break it. I think it has potential to be a really powerful tool and a major time saver. The Python code for the script is followed by an example of the text file needed to supply the key: value pairs to the tool. Read the descriptive text in the code comments for more info on how to use it. # GeMS_ReplaceSourceIDs.py # Andrew L. Wunderlich # ***@***.*** # 11/22/2023 """ Steps through feature classes in a GeMS geodatabase and finds/replaces SourceIDs. This tool has two main inputs, a GeMS-style geodatabase and a text file describing pairs of values to find/replace in any (and ONLY in) GeMS-y SourceID fields in the database (e.g., DataSourceID, DescriptionSourceID, LocationSourceID, etc.) It utilizes the GeMS_utilityFunctions.py to inventory the database and uses the inventory to find any fields ending in "SourceID". The user can specify whether values should also be replaced in the DataSources table DataSources_ID field. Optionally, the user can choose to process a stand-alone metadata file as well, which helps maintain continuity between SourceIDs used in the geodatabase and those listed in the metadata. The replacement values file is a simple pipe-delimited ("|") text file listing pairs of values, the first being the value to find, the second being its replacement, e.g., "DAS001|TGSGM_429SE_001". Each pair of values must be on its own line and there must not be more than one delimiter (one pair of values) per line (A|B is OK, A|B|C will be skipped). See the Resources folder for an example replacement values file and more information. Within the geodatabase SourceID fields, each key is replaced with its corresponding value. The tool can handle features that have multiple IDs in one field, e.g., key: value pairs A|1 and B|2 will replace a field value "A|B" with "1|2" or "A|C|B" with "1|C|2". The find/replace uses regex to find and replace all key: value pairs in one pass for each row in the table being processed and is set to match whole words only to prevent key substrings from replacing parts of other keys, i.e., "DAS001|ABC-123" won't erroneously find/replace "DAS001Z" with "ABC-123Z". If the option to replace values in a stand-alone metadata file is selected, the "Metadata file" field will become visible in the tool dialog and the user can choose either a txt or xml metadata file to process. The logic is the same in the file as the database and uses the same regex command to do the find/replace. Structure of XML or formatted text files is not affected by the read/write and files that were FDGC-compliant before processing tested with USGS MP (https://www1.usgs.gov/mp/) after being modified by this tool returned no errors. This tool modifies the input geodatabase and the metadata file (if optioned). ALWAYS back up your datasets before using a tool that modifies any of the inputs!! Script setup for ArcGIS Pro toolbox: Name: ReplaceSourceIDs Label: Replace SourceIDs Description: Replace SourceIDs in a GeMS-style geodatabase and optionally a stand-alone metadata file (txt or xml) using a delimited text file to identify find/replace pairs. Store tool with relative path: YES ArcGIS Pro Tool Properties required Parameters: Input geodatabase - Workspace; Required; Input Replacement values file - File; Required; Input; Filter: File-txt Ignore DataSources_IDs - Boolean; Optional; Input Replace values in stand-alone metadata file - Boolean ; Optional; Input Metadata file - File; Optional; Input; Filter: File-txt;xml Paste the following code into the Validation parameters replacing the default "def updateParameters(self)" section: def updateParameters(self): # Modify parameter values and properties. # This gets called each time a parameter is modified, before # standard validation. if str(self.params[3].value).lower() == "true": self.params[4].enabled = True else: self.params[4].enabled = False self.params[4].value = "" return """ import arcpy import importlib import re import sys import shutil import os import GeMS_utilityFunctions as GuF # Usage string to print usage = """ Usage: GeMS_ReplaceSourceIDs.py <Input geodatabase> <Replacement values file> <Ignore DataSources_IDs> <Replace values in standalone metadata file> <Metadata file> <Input geodatabase> is an GeMS-style geodatabase (modified by the tool). <Replacement values file> is a formatted text file that specifies pairs of IDs: old DAS IDs to search for and new DAS IDs to replace them separated by a delimiter ("|" by default). Key (search) items must be unique otherwise the last duplicate entry will be used by the tool. See ReplaceDASValues.txt in the GeMS Resources folder for an example and format instructions. <Ignore DataSources_IDs> boolean (True/False with or without quotes) that determines if values in the DataSources table DataSources_ID field will be searched and replaced (False) or not (True; default). <Replace values in standalone metadata file> boolean (True/False with or without quotes) that will determine if existing values in a stand alone metadata file will also be searched and replaced (True) or not (False; default) <Metadata file> is a formatted text or xml file of the FGDC metadata for the geodatabase being processed. This file is modified by the tool. """ ### This value is used to parse the input text file. Default is "|" # Could be made into a sys.argv[] to allow for more robust find/replace separator = "|" ### Gather info from user inputs # Input geodatabase gdb = sys.argv[1] # Read raw text from text file (values to find and replace) keyLinesRaw = open(sys.argv[2], "r", encoding="utf-8").readlines() #GuF.addMsgAndPrint("Raw text read from input file: " + str(keyLinesRaw)) # Check if user specified to ignore DataSources_ID if sys.argv[3].lower() == "true": noSource = True else: noSource = False # Check if user specified a metadata file to replace values if sys.argv[4].lower() == "true": replaceMeta = True else: replaceMeta = False # If true, get metadata file name metaFile = sys.argv[5] if replaceMeta: if len(metaFile) > 3: GuF.addMsgAndPrint("Metadata file: " + metaFile) else: GuF.addMsgAndPrint("Metadata file name is invalid: " + metaFile) replaceMeta = False # Remove empty lines, leading/trailing spaces, and comments from keyLinesRaw # Create an empty list to house the value pairs keyLines = [] for lin in keyLinesRaw: lin = lin.strip() if len(lin) > 1 and lin[0:1] != "#": keyLines.append(lin) # Prepare values to use for find-replace as a dictionary # Create an empty dictionary to house the {key: value} pairs read from the text file GuF.addMsgAndPrint("Reading input text file for find/replace pairs...") replaceDict = {} n = 0 while n < len(keyLines): # Separate the values in keyLines using the separator specified above # Currently does not handle delimiter being included in either key or value vals = keyLines[n].split(separator) if len(vals) == 2: for i in range(len(vals)): # Strip out quotes vals[i] = vals[i].replace("'", "") vals[i] = vals[i].replace('"', "") # Remove leading and trailing whitespace within each value vals[i] = vals[i].strip() # Add the vals list items to the dict as a key and value pair (k: v) replaceDict[vals[0]] = vals[1] else: GuF.addMsgAndPrint("Line:\n {}\nhas wrong number of values. Skipping...".format(keyLines[n])) n = n + 1 # Set the workspace arcpy.env.workspace = gdb # Inventory gdb and get the list of fields that end in "SourceID" or "Sources_ID" d = GuF.gdb_object_dict(gdb) # Inform the user of their choice to ignore the IDs in the DataSources table if noSource: GuF.addMsgAndPrint("Ignoring DataSources_ID values in DataSources table") else: GuF.addMsgAndPrint("DataSources_ID values in DataSources table will be replaced") # Look through the database inventory and find all the SourceID fields, then use regex to find/replace for k, v in d.items(): if "fields" in v: fields = [f.name for f in v["fields"]] itemCount = 0 for field in fields: if noSource: if field.endswith("SourceID"): # DEBUG Print the feature class and name of field GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field) # Set the feature class to work with based on the field's catalogPath updateTable = v["catalogPath"] # Set the field to work with updateFieldList = [field] # Use an UpdateCursor to target the class and field with values to replace with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows: for updateRow in updateRows: # Store the value of the row being searched in fieldValue variable fieldValue = updateRow[0] #GuF.addMsgAndPrint(" Original field contents: " + fieldValue) # Use regex to find replaceDict keys and replace with matching value # This method is extremely efficient and replaces instances of all keys # in one pass instead of searching for each key one at a time. It also # matches whole words only to reduce the chances of mistakes with keys # that are substrings of another, e.g., DAS001 and DAS0011. keysGdb = (re.escape(g) for g in replaceDict.keys()) patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b') resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue) # Replace the replaceDict keys with matching values and put in updateRow field. if fieldValue == resultGdb: #GuF.addMsgAndPrint(" No matching key found in string: " + fieldValue) itemCount = itemCount else: #GuF.addMsgAndPrint(" Replaced field contents with: " + resultGdb) itemCount = itemCount + 1 updateRow[0] = resultGdb updateRows.updateRow(updateRow) GuF.addMsgAndPrint(" Updated values in " + str(itemCount) + " records") else: if field.endswith("SourceID") or field.endswith("Sources_ID"): # DEBUG Print the feature class and name of field GuF.addMsgAndPrint("SourceID location: " + v["baseName"] + "; Field name: " + field) # Set the feature class to work with based on the field's catalogPath updateTable = v["catalogPath"] # Set the field to work with updateFieldList = [field] # Use an UpdateCursor to target the class and field with values to replace with arcpy.da.UpdateCursor(updateTable, updateFieldList) as updateRows: for updateRow in updateRows: # Store the value of the row being searched in fieldValue variable fieldValue = updateRow[0] #GuF.addMsgAndPrint(" Original field contents: " + fieldValue) # Use regex to find replaceDict keys and replace with matching value # This method is extremely efficient and replaces instances of all keys # in one pass instead of searching for each key one at a time. It also # matches whole words only to reduce the chances of mistakes with keys # that are substrings of another, e.g., DAS001 and DAS0011. keysGdb = (re.escape(g) for g in replaceDict.keys()) patternGdb = re.compile(r'\b(' + '|'.join(keysGdb) + r')\b') resultGdb = patternGdb.sub(lambda u: replaceDict[u.group()], fieldValue) # Replace the replaceDict keys with matching values and put in updateRow field. if fieldValue == resultGdb: # GuF.addMsgAndPrint(" No matching key found in string: " + fieldValue) itemCount = itemCount else: # GuF.addMsgAndPrint(" Replaced field contents with: " + resultGdb) itemCount = itemCount + 1 updateRow[0] = resultGdb updateRows.updateRow(updateRow) GuF.addMsgAndPrint(" Updated values in " + str(itemCount) + " records") # Open the user-specified metadata file and do the find/replace if replaceMeta: try: GuF.addMsgAndPrint("Reading metadata file: " + os.path.split(metaFile)[1]) with open(metaFile, "r", encoding="utf8") as mF: mFData = mF.read() # Replace the keys with matching values GuF.addMsgAndPrint(" Finding keys and replacing with values...") keysMeta = (re.escape(g) for g in replaceDict.keys()) patternMeta = re.compile(r'\b(' + '|'.join(keysMeta) + r')\b') resultMeta = patternMeta.sub(lambda u: replaceDict[u.group()], mFData) # Rewrite the metadata file GuF.addMsgAndPrint(" Writing updates to " + os.path.split(metaFile)[1]) with open(metaFile, "w", encoding="utf8") as mF: mF.write(resultMeta) GuF.addMsgAndPrint("Metadata file updated successfully!") except FileNotFoundError: GuF.addMsgAndPrint("Invalid metadata file: " + metaFile) Example of the text file (ReplaceSourceIDValues.txt): # Example find/replace file for use with GeMS_ReplaceSourceIDs.py # # The value on the left is the search term (key). The value on # the right is the replace term (value). DAS100|NEW_DEF1 DAS101|NEW_DEF2 DAS102|NEW_DEF3 DAS107|FGDC-STD-013-2006 GSABUL129|GSA_BULL-129(7-8) # NOTES: # Keys must be unique in the list. Duplicate keys will result in the # last instance of the key having its value assigned to the pair: # AB|12 # AB|34 <- this pair will be kept # # Empty lines are OK (they are ignored). # Lines that begin with # symbol are comments and are ignored. # Don't put comments at the end of a line. # like this # Within a line, values must be separated by delimiter. | is default. # Lines with more than one delimiter will be ignored; they are # interpreted as having an incorrect number of arguments. # # Avoid punctuation within a key or value. The script uses regex to # do the substitutions so periods, semicolons, etc. could produce # unexpected/undesireable results when searching for whole words. # # Leading and trailing whitespaces (spaces, tabs) between values and # delimiters are ignored: "AB | 12" searches for "AB" and replaces # with "12". # # Spaces within a key or value are OK: "AB CD|12 34" # If you need to replace a value that contains the default delimiter, # change the tool delimiter, e.g., to replace "AB|12" with "CD", # edit the script delimiter to "," and make the pair "AB|12,CD" # # If you have a list of {key: value} pairs in a spreadsheet, # save it in csv format, specifying the | character as the delimiter # or another delimiter if keys or values contain the default # delimiter. — Reply to this email directly, view it on GitHub<#91>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABJASOZISL7XAP4IBDM3DCLYGZB2NAVCNFSM6AAAAAA76LBD36VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHEYDGMBUHE>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCRIPT TO SHARE: Replace SourceID values within a GeMS database AND its stand-alone metadata (txt or xml) #91

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

SCRIPT TO SHARE: Replace SourceID values within a GeMS database AND its stand-alone metadata (txt or xml) #91

alwunder Nov 28, 2023

Replies: 1 comment

ethoms-usgs Nov 28, 2023 Maintainer

alwunder
Nov 28, 2023

ethoms-usgs
Nov 28, 2023
Maintainer