Update facet id generation to match mixer#485
Update facet id generation to match mixer#485n-h-diaz wants to merge 4 commits intodatacommonsorg:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses the need for stable facet IDs across different systems by aligning the facet ID generation in the new ingestion pipeline with the established logic in the Mixer backend. This change ensures that internal teams relying on consistent facet IDs after migrating to Spanner will experience seamless data integration and continuity. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully updates the facet ID generation logic to align with the existing Mixer Go implementation. The introduction of the DataUtils class with the FNV-1a hash algorithm is a clear and direct approach to achieve consistency. The changes in Observation.java correctly integrate this new utility, and the addition of DataUtilsTest.java provides good test coverage for the new functionality. The updates to CacheReaderTest.java ensure that existing tests reflect the new ID generation. The comment regarding method parameter count is a valid general best practice for Java and has been retained. Overall, the changes are well-implemented and address the stated objective.
pipeline/data/src/main/java/org/datacommons/ingestion/data/DataUtils.java
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aims to update the facet ID generation to align with an existing Go implementation in Mixer. While the intent is clear, I've identified two critical issues. Firstly, the new Java implementation for facet ID generation in DataUtils.java does not correctly replicate the Go logic, as it introduces separators where there should be none, which will lead to different hash results and defeats the purpose of this change. Secondly, in Observation.java, the hashCode() method has been removed while the equals() method is retained. This violates the fundamental Java Object contract and will cause problems when Observation objects are used in hash-based collections. I have provided detailed comments and suggestions for both issues.
pipeline/data/src/main/java/org/datacommons/ingestion/data/DataUtils.java
Show resolved
Hide resolved
pipeline/data/src/main/java/org/datacommons/ingestion/data/Observation.java
Show resolved
Hide resolved
Can you add some context on who is requesting this and why? |
This request is from NL2SQL team since they use allowlisted facet ids (cc @gmechali who has more context) |
Let's discuss this. We want to avoid this level of coupling since it makes it difficult to make updates in the future. |
Some internal teams using DC depend on stable facet ids after migrating to spanner and requested that we make them consistent with our existing backend
So this PR replicates the facet id generation currently in mixer in the new ingestion
https://github.com/datacommonsorg/mixer/blob/0618c1f3ef80703c98fc97f6c6c6e5cd3d7c13d3/internal/util/util.go#L497-L515