-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUMULUS-3368]: [User Contribution] Introducing allFilesPresent metadata field for Cumulus Collections. Fixes Issue #3193 #3492
base: master
Are you sure you want to change the base?
Changes from 5 commits
14064d1
0b481a4
4f862d4
20235b1
bb2e185
0118baf
73bc21b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -48,6 +48,10 @@ A list of buckets with types that will be used to assign bucket targets based on | |||||
|
||||||
A Cumulus [collection](https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js) object. Used to define granule file groupings and granule metadata for discovered files. The collection object utilizes the collection type key to generate types in the output object on discovery. | ||||||
|
||||||
##### Collection Meta | ||||||
|
||||||
If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules's missing files. Otherwise, the default behavior ignores missing files in granules. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The wording above is still confusing. Do you mean something like the following?
|
||||||
|
||||||
#### DuplicateGranuleHandling | ||||||
|
||||||
A string configuration that configures the step to filter the granules discovered: | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -225,6 +225,62 @@ const checkGranuleHasNoDuplicate = async (granuleId, duplicateHandling) => { | |
throw new Error(`Unexpected return from Private API lambda: ${JSON.stringify(response)}`); | ||
}; | ||
|
||
/** | ||
* Checks if all files from a collection are present for files from a granuleId | ||
* | ||
* | ||
* @param {Object} config - the event config | ||
* @param {Object} filesByGranuleId - Object with granuleId for keys with an array of | ||
* matching files for each | ||
* @param {string} granuleId - the Id of a granule | ||
* @returns {Boolean} - returns true if all file in collection config are present for granuleId | ||
*/ | ||
const checkGranuleHasAllFiles = (config, filesByGranuleId, granuleId) => { | ||
const granuleHasAllFiles = filesByGranuleId[granuleId].length >= config.collection.files.length; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If collection.files[].regex matchs multiple granule files, how do we make sure each regex has at least one granule file? The assertion is weak. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A granule can have 10 files but only 5 collection.files configuration. |
||
return granuleHasAllFiles; | ||
}; | ||
|
||
/** | ||
* Filters out granules that are missing files from a collection | ||
* | ||
* | ||
* @param {Object} params.filesByGranuleId - Object with granuleId for keys with an array of | ||
* matching files for each | ||
* @param {Object} params.config - the event config | ||
* @returns {Array.string} - returns granuleIds that contain all files from a collection | ||
*/ | ||
const filterGranulesWithoutAllFiles = ({ filesByGranuleId, config }) => { | ||
const checkResults = Object.keys(filesByGranuleId).filter( | ||
checkGranuleHasAllFiles.bind(this, config, filesByGranuleId) | ||
); | ||
return checkResults; | ||
}; | ||
|
||
/** | ||
* Handles granules without all files present according to the allFilesPresent boolean | ||
* | ||
* allFilesPresent = true : granules missing files will be removed | ||
* allFilesPresent = false: ignore missing files in granules | ||
* | ||
* @param {Object} params.filesByGranuleId - Object with granuleId for keys with an array of | ||
* matching files for each | ||
* @param {Boolean} params.allFilesPresent - boolean that defines whether or not all files should | ||
* be present in a granule before it can be discovered | ||
* @param {Object} params.config - the event config | ||
* @returns {Object} returns filesByGranuleId with all the granules missing files removed | ||
*/ | ||
const handleGranulesWithoutAllFilesPresent = ({ filesByGranuleId, allFilesPresent, config }) => { | ||
if (!allFilesPresent) { | ||
return filesByGranuleId; | ||
} | ||
const filteredKeys = filterGranulesWithoutAllFiles({ | ||
granuleIds: Object.keys(filesByGranuleId), | ||
filesByGranuleId, | ||
config, | ||
}); | ||
return pick(filesByGranuleId, filteredKeys); | ||
}; | ||
|
||
/** | ||
* Filters granule duplicates from a list of granuleIds according to the | ||
* configuration in duplicateHandling: | ||
|
@@ -314,6 +370,16 @@ const discoverGranules = async ({ config }) => { | |
concurrency: get(config, 'concurrency', 3), | ||
}); | ||
|
||
let allFilesPresent = false; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggest to move line 373-376 to handleGranulesWithoutAllFilesPresent function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
if (config.collection.meta) { | ||
allFilesPresent = config.collection.meta.allFilesPresent || false; | ||
} | ||
filesByGranuleId = handleGranulesWithoutAllFilesPresent({ | ||
filesByGranuleId, | ||
allFilesPresent, | ||
config, | ||
}); | ||
|
||
const discoveredGranules = map(filesByGranuleId, buildGranule(config)); | ||
|
||
logger({ granules: discoveredGranules.map((g) => g.granuleId) }).info(`Discovered ${discoveredGranules.length} granules.`); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since
allFilesPresent
is used in our task, suggest to add it to https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js#L211There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add the following to collection and granule mapping:
https://github.com/nasa/cumulus/blob/master/packages/es-client/config/mappings.json
"meta": {
"type": "object",
"dynamic": false
},
There is a ticket CUMULUS-3417 about mapping issue, and it can happen for collection record as well