The file can be used to edit column in an orc file, by replacing the value of column based on the regex or by replacing value of a column from a list of values and save the orc file with similar partition column to a new location
Values to be replaced as per the requirement:
- basepath of orc file: it's the basepath from where orc file to be edited is present
- column_name: it's the name of the column to be edited
- value to be replaced: it's the value of the column to be replaced
- value which is to be replaced: it's the value with which the value in point#3 to be replaced
- basepath where orc file to be written: it's the basepath where the new edited orc file is to be written
Software Dependencies:
- Spark version Spark-2.2 or above.
- Python-2.7 or above
##################################################################################
Below is the command to submit the spark job:
spark2-submit --master yarn --deploy-mode client --executor-memory=4g --num-executors=3 --executor-cores=2 --driver-memory=2g orc_filre_edit.py