Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoPR for paper information correction #4147

Merged
merged 106 commits into from
Dec 27, 2024
Merged

AutoPR for paper information correction #4147

merged 106 commits into from
Dec 27, 2024

Conversation

yufei118liu
Copy link
Collaborator

@yufei118liu yufei118liu commented Dec 11, 2024

This PR implements a semi-automated bulk processing of metadata corrections. We hope that it will vastly simplify the metadata correction experience for both users and Anthology staff. It includes the following pieces:

  • A link is added to each paper page which, when clicked, displays a modal dialog allowing for editing of the paper's title, author list, and abstract
  • Submitting the dialog takes the user to a Github issue template populated with the changed information in a JSON data block
  • A new script, process_bulk_metadata.py, processes these work items, makes the corrections, bundles them into a branch, and creates a unified PR

TODO items:

  • Replace the direct link with a dialog that allows for UI manipulation of the data
  • Finish the bulk processing script (may finish this in another PR so this can go live)
  • Create an LLM bot that attempts to validate each PR by checking whether the proposed changes match the PDF
  • Figure out why the text on the yellow "Fix metadata" button is white, not black
  • Close old metadata change requests with a note to use the new format (Clean up metadata issues #4175)

yufei118liu and others added 3 commits December 10, 2024 14:22
This change contains the following features:

On each paper page, a collapsible is added at the bottom, with the indication "File a correction". When clicked on, it shows a form prefilled current of the paper
Users can modify these information and click submit
After submission, the user is taken to a window that creates pull request to the repository. The effect of the pr is to add file containing these new information that the user entered. The path to the file and the branch that the pr is aiming can be specified.
This merge contains the following features:

On each paper page, a collapsible is added at the bottom, with the indication "File a correction". When clicked on, it shows a form prefilled current of the paper
Users can modify these information and click submit
After submission, the user is taken to a window that creates pull request to the repository. The effect of the pr is to add file containing these new information that the user entered. The path to the file and the branch that the pr is aiming can be specified.
@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

Thanks for starting this off! We can deal with the UI elements once we get the functionality working. I see that this creates a PR against a Markdown file. However, the PR needs instead to be created against the relevant portion of the the XML files. For example, if I edit K17-1003, it should create a PR against the file data/xml/K17.xml, for the <paper id="3"> block. Does that make sense?

@mjpost mjpost added this to the 2024Q4 milestone Dec 11, 2024
@mjpost mjpost requested a review from Copilot December 11, 2024 16:47

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 3 changed files in this pull request and generated no suggestions.

Files not reviewed (2)
  • hugo/layouts/papers/single.html: Language not supported
  • hugo/layouts/partials/correction.html: Language not supported
@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

At the moment, it seems to be trying to create a Markdown file, e.g., as pictured. But this Markdown file is not part of the repository; instead, it is an artifact of the build process, created by running create_hugo_yaml.py. We need to edit he XML itself.

The trick here is whether a pull request can be automated. If we can find a way to (a) grab the current relevant XML file (b) modify it according to the parameters and (c) create a PR with the new file, that would be sufficient. I suggest we take these one by one.

image

@yufei118liu
Copy link
Collaborator Author

yufei118liu commented Dec 11, 2024

Yes it makes sense. However, I have not yet found a solution to locate and edit a sepecific section of a file in this way. My current vision is that we can put all such .md files in a specific directory, and use a python script to parse the .xml files and execute the change regularly.

@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

Yes, that makes sense, this could be a workaround for that (though note that the information is not all passed on correctly at the moment).

Can you summarize the research you've done about automating PRs? What is possible in this area? For example, we know that we can do the following in Javascript:

  • Identify the correct XML file in the ACL repo
  • Modify its DOM in the correct location

Would it be possible, then to create a PR, replacing the existing file with the new one?

Copy link

github-actions bot commented Dec 11, 2024

Build successful. Some useful links:

This preview will be removed when the branch is merged.

@yufei118liu
Copy link
Collaborator Author

  1. The current version (with a tiny change) can already identify the .xml file and bring the user to the modifying window.
  2. Modifying/parsing can be done through a script. However, the main challenge is that creating a pr this way request authentification, which I'm not sure can be done when the user visit the website. I'll look more into this.
  3. Replacing the existing file with a new one would be suboptimal, since every .xml file are long, and in the current method, the content of the new file are converted to string and passed as a parameter.

hugo/layouts/papers/single.html Show resolved Hide resolved
hugo/layouts/partials/correction.html Outdated Show resolved Hide resolved
Comment on lines 16 to 29
<label for="paper_author">Paper authors: </label>
{{ range $index, $element := .paper.author }}
{{ if $index }}
<label for="paper_author_first">First Name: </label>
<input type="text" id="paper_author_first_{{$index}}" name="paper_author_first_{{$index}}" value="{{ $element.first }}"><br><br>

<label for="paper_author_last">Last Name: </label>
<input type="text" id="paper_author_last_{{$index}}" name="paper_author_last_{{$index}}" value="{{ $element.last }}"><br><br>

<!-- Is full name always first + last name? -->
<!-- <label for="paper_author_full">Full Name: </label>
<input type="text" id="paper_author_full_{{$index}}" name="paper_author_full_{{$index}}" value="{{ $element.full }}"><br><br> -->
{{ end }}
{{ end }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two potential complications:

  1. What do we do with names that require disambiguation through author IDs? We probably don't want to expose the complexity of author IDs to end users...
  2. I understand that this is a prototype/proof-of-concept, but just a note that we’d ultimately need functionality to remove, add, and ideally reorder author fields as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID could be a hidden field embedded into the web page. As for the aesthetics and design, I have a new mockup I'll post below.

@mbollmann
Copy link
Member

Cool that you're looking into concrete ideas for this!

Regarding the .md files: I would definitely not want to add temporary Markdown files to the repository, if this is the suggestion. It clutters the commit history and I’m not sure it provides any benefit over simply opening a new issue containing this information. After all, we could also run a Python script that interacts with the Github API to read and parse issues, and then modifies the XML accordingly.

I also feel that requiring the user suggesting changes to authenticate is a good thing, as it's a potential hurdle for abuse of this system. We shouldn't discount the security implications of such automation.

Finally, if we need to find the line in an XML referring to a specific paper, there's always the option to embed that information into the website at build time. (Although this may run the risk that the information is outdated by the time the user clicks on the correction button.)

@mbollmann
Copy link
Member

After all, we could also run a Python script that interacts with the Github API to read and parse issues, and then modifies the XML accordingly.

To add: This was kinda my original idea with the issue templates, that we might be able to automate the creation of a PR from those. Unfortunately, people currently don't use them consistently enough for that, but I wouldn't discount that idea entirely. :)

@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

I agree that we shouldn't create new temporary files in the repo. If we can't directly submit a PR, I'd suggest that we would create a queue outside the repo that could then be handled by a script. But I prefer to have the users themselves create a PR using their own Github ID.

@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

I'm attaching a mockup of the interface (generated with the help of Claude) which adds author list manipulation. I'd prefer to have a button that would pop up a dialog containing this code, so everything would be self-contained. You can download and test this file.

image

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Paper Editor</title>
    <link href="https://cdnjs.cloudflare.com/ajax/libs/tailwindcss/2.2.19/tailwind.min.css" rel="stylesheet">
    <style>
        .draggable:hover {
            cursor: move;
            background-color: #f3f4f6;
        }
        .dragging {
            opacity: 0.5;
        }
    </style>
</head>
<body class="bg-gray-100 p-6">
    <div class="max-w-4xl mx-auto bg-white rounded-lg shadow p-6">
        <form id="paperForm" class="space-y-6">
            <div>
                <label class="block text-sm font-medium text-gray-700">Title</label>
                <input type="text" id="title" class="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-500 focus:ring-indigo-500" value="Exploring the Syntactic Abilities of RNNs with Multi-task Learning">
            </div>

            <div>
                <div class="flex justify-between items-center">
                    <label class="block text-sm font-medium text-gray-700">Authors</label>
                    <button type="button" id="addAuthor" class="px-4 py-2 text-sm font-medium text-gray-700 bg-white border border-gray-300 rounded-md hover:bg-gray-50">
                        Add Author
                    </button>
                </div>
                <div id="authorsList" class="mt-2 space-y-3">
                    <!-- Authors will be added here -->
                </div>
            </div>

            <div>
                <label class="block text-sm font-medium text-gray-700">Abstract</label>
                <textarea id="abstract" rows="6" class="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-500 focus:ring-indigo-500">Recent work has explored the syntactic abilities of RNNs using the subject-verb agreement task, which diagnoses sensitivity to sentence structure. RNNs performed this task well in common cases, but faltered in complex sentences (Linzen et al., 2016). We test whether these errors are due to inherent limitations of the architecture or to the relatively indirect supervision provided by most agreement dependencies in a corpus. We trained a single RNN to perform both the agreement task and an additional task, either CCG supertagging or language modeling. Multi-task training led to significantly lower error rates, in particular on complex sentences, suggesting that RNNs have the ability to evolve more sophisticated syntactic representations than shown before. We also show that easily available agreement training data can improve performance on other syntactic tasks, in particular when only a limited amount of training data is available for those tasks. The multi-task paradigm can also be leveraged to inject grammatical knowledge into language models.</textarea>
            </div>

            <input type="hidden" id="anthology_id" value="K17-1003">

            <div class="flex justify-end space-x-4">
                <button type="button" class="px-4 py-2 text-sm font-medium text-gray-700 bg-white border border-gray-300 rounded-md hover:bg-gray-50">
                    Cancel
                </button>
                <button type="submit" class="px-4 py-2 text-sm font-medium text-white bg-indigo-600 border border-transparent rounded-md hover:bg-indigo-700">
                    Submit
                </button>
            </div>
        </form>
    </div>

    <script>
        // Initial authors data
        const initialAuthors = [
            { first: "Émile", last: "Enguehard", affiliation: "" },
            { first: "Yoav", last: "Goldberg", affiliation: "" },
            { first: "Tal", last: "Linzen", affiliation: "" }
        ];

        let authors = [...initialAuthors];
        let draggedItem = null;

        function createAuthorRow(author, index) {
            const div = document.createElement('div');
            div.className = 'draggable flex gap-4 items-center p-4 bg-gray-50 rounded-lg';
            div.draggable = true;
            div.innerHTML = `
                <input type="text" placeholder="First Name" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.first}">
                <input type="text" placeholder="Last Name" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.last}">
                <input type="text" placeholder="Affiliation" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.affiliation}">
                <button type="button" class="remove-author px-2 py-1 text-red-600 hover:text-red-800">×</button>
            `;

            // Add drag and drop event listeners
            div.addEventListener('dragstart', (e) => {
                draggedItem = index;
                div.classList.add('dragging');
            });

            div.addEventListener('dragend', () => {
                div.classList.remove('dragging');
            });

            div.addEventListener('dragover', (e) => {
                e.preventDefault();
            });

            div.addEventListener('drop', (e) => {
                e.preventDefault();
                const dropIndex = index;
                if (draggedItem !== dropIndex) {
                    const temp = authors[draggedItem];
                    authors.splice(draggedItem, 1);
                    authors.splice(dropIndex, 0, temp);
                    renderAuthors();
                }
            });

            // Add remove button listener
            div.querySelector('.remove-author').addEventListener('click', () => {
                authors.splice(index, 1);
                renderAuthors();
            });

            // Add input change listeners
            const inputs = div.querySelectorAll('input');
            inputs[0].addEventListener('change', (e) => authors[index].first = e.target.value);
            inputs[1].addEventListener('change', (e) => authors[index].last = e.target.value);
            inputs[2].addEventListener('change', (e) => authors[index].affiliation = e.target.value);

            return div;
        }

        function renderAuthors() {
            const authorsList = document.getElementById('authorsList');
            authorsList.innerHTML = '';
            authors.forEach((author, index) => {
                authorsList.appendChild(createAuthorRow(author, index));
            });
        }

        // Initialize authors list
        renderAuthors();

        // Add author button
        document.getElementById('addAuthor').addEventListener('click', () => {
            authors.push({ first: '', last: '', affiliation: '' });
            renderAuthors();
        });

        // Form submission
        document.getElementById('paperForm').addEventListener('submit', (e) => {
            e.preventDefault();
            const params = new URLSearchParams();
            
            params.append('title', document.getElementById('title').value);
            params.append('abstract', document.getElementById('abstract').value);
            
            authors.forEach((author, index) => {
                params.append(`author${index + 1}_first`, author.first);
                params.append(`author${index + 1}_last`, author.last);
                params.append(`author${index + 1}_affiliation`, author.affiliation);
            });
            
            params.append('anthology_id', document.getElementById('anthology_id').value);

            const url = `https://aclanthology.org/edit?${params.toString()}`;
            alert(url);
        });
    </script>
</body>
</html>

@mjpost
Copy link
Member

mjpost commented Dec 11, 2024

@mbollmann I've only looked into this briefly, but I am starting to think that we won't be able to automate the creation of PRs.

With the issue templates, what strikes me is how often people fail to use them appropriate, for example, failing to substitute what are to me at least clearly variable templates.

What you wrote here suggests to me another workflow:

To add: This was kinda my original idea with the issue templates, that we might be able to automate the creation of a PR from those. Unfortunately, people currently don't use them consistently enough for that, but I wouldn't discount that idea entirely. :)

Based on this PR, we :

  • Add a link from each paper page that displays the dialog and allows editing.
  • Submitting this creates a Github issue under their account. Since we control the input, there is less risk that variables won't be filled. And since it's tied to a user account, it reduces the risk of abuse
  • We then write code that automatically parses these issues and commits the changes in batches.

I'm thinking that this workflow might actually work. It would also have additional advantages: we could batch the PRs, say on a weekly basis, instead of having to manage (approve and merge) each PR one-by-one. We could also manage a lot of the complexity of tag-handling and IDs in Python code on the server.

Our issues board is already swamped with metadata change requests. This could really help with this and would complete the work you started earlier.

(Apologies if this is what you had in mind, this is only now crystalizing for me).

image

@mbollmann
Copy link
Member

Based on this PR, we :

  • Add a link from each paper page that displays the dialog and allows editing.

  • Submitting this creates a Github issue under their account. Since we control the input, there is less risk that variables won't be filled. And since it's tied to a user account, it reduces the risk of abuse

Yes, but then we're at a point where I'm not sure we even need to implement our own dialog at all. We could just embed a link that opens a new issue pre-filled with the current title/author list etc., like so:

https://github.com/acl-org/acl-anthology/issues/new?template=01-metadata-correction.yml&anthology_id=K17-1003&paper_title=Exploring%20the%20Syntactic%20Abilities%20of%20RNNs%20with%20Multi-task%20Learning&author_name=Émile%20Enguehard,%20Yoav%20Goldberg,%20Tal%20Linzen

Since we can't enforce that people don't change anything in the issue text, even if we generate it from a dialog on our website, it might be a better approach to validate that with a Github Action or script. It could even automatically post on these issues if there's a problem that people should fix...

@mjpost
Copy link
Member

mjpost commented Dec 12, 2024

Okay, this is great, it seems we are finally homing in on a solution to this problem that has long dogged us. We may yet achieve this in Q4!

Directly linking to prefilled templates like this is a great idea. My main concern is that I don't trust users to make correct edits to those templates while maintaining machine-readability. The state of the current board is a good example of this (pictured above), where people interpret fields differently. We deal with lots of problems with author name formatting, and so on.

It seems better to me to use a modern web UI dialog where we present only the most basic fields that capture 90%+ of use cases and let the user make the changes that way. This avoids the manual error prone in unconstrained manipulation of text fields. This would also allow us to handle, e.g., author ID tags, which would just be hidden from the user (or could be a drop-down list), and any other arbitrary field

We would still instantiate the template, but it would be marked all over with language mentioning that the fields are meant to be read-only to preserve machine-readability. We couldn't stop users from changing them, of course, but likely at this point, only those who knew what they were doing would. In fact, the template could just be a single YAML block containing all the updated information:

==== YAML BLOCK ====
==== PLEASE DO NOT EDIT UNLESS YOU KNOW WHAT YOU'RE DOING ====
id: K17-1003
title: A new era in metadata editing for the <fixed-case>ACL</fixed-case> <fixed-case>A</fixed-case>anthology
abstract: ...
authors:
- name: Post, Matt
   affiliation: Microsoft
- name: Liu, Yang
   id: yang-liu-edinburgh
==== END YAML BLOCK ===

We could then have a weekly script that gathers these issues and creates a batch PR which would require human approval.

@mbollmann
Copy link
Member

My main concern is that I don't trust users to make correct edits to those templates while maintaining machine-readability.

Oh, absolutely, but I wouldn't trust them not to add irrelevant comments in a web UI that we create either. :)

I think we have to expect and handle these cases in the downstream process either way.

@mbollmann
Copy link
Member

We could then have a weekly script that gathers these issues and creates a batch PR which would require human approval.

What if we only want to approve some of these changes though?

@mjpost
Copy link
Member

mjpost commented Dec 12, 2024

What if we only want to approve some of these changes though?

The script could require some sign of approval from an Anthology team member, e.g., the addition of a label. @nschneid often performs this approval service on PRs, for example, checking that metadata corrections fit the PDF (I'd also love to add an LLM bot that does the same). It would be good to start conservatively.

@nschneid
Copy link
Contributor

Would be happy to be replaced by a bot. :)

@mbollmann
Copy link
Member

FWIW, I was playing around once with making a Github Action that posts a direct link to the paper based on the Anthology ID mentioned in the issue, but I seemed a bit tricky to do. Maybe if you write a proper script it would be more feasible. A bot posting an image of the first page of the PDF would probably be even more helpful :)

@mjpost
Copy link
Member

mjpost commented Dec 12, 2024

That's a nice idea that would simplify the checking, even if we still rely on @nschneid to do it.

A simple thing to do is just grab the thumbnail, e.g., https://aclanthology.org/thumb/K17-1003.jpg

@mjpost
Copy link
Member

mjpost commented Dec 27, 2024

Sorry, to be more clear, I meant whether there is an id attribute on a particular paper, e.g.,

<author id="fei-liu-gga"><first>Fei</first><last>Liu</last></author>

I have implemented the following solution:

  • In the paper page creation, every paper has an ID, since these are dumped from the YAML file
  • These IDs are passed to the Github issue. If new authors are added in the form, they get an empty author ID
  • When bulk processing metadata, we take each ID, and when editing the XML with the corrections, we search for that ID. If it's present in the XML, we retain it. This is only true for a very small subset, like the above example

In the future, it sounds like I can use NameSpecification when switching to your library.

However, I am hoping that we can merge this PR, and port to the new library in Q1 next year, perhaps under the discussed "sink-or-swim" scenario.

Note that I also updated the bulk processing script to close old-style metadata correction issues with a note, asking authors to resubmit their issue using the new system. You can see an example on #4208. It's very nice and I hope will really clear out our huge number of issues.

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to read the bulk processing script now primarily from a security perspective, and I see one issue, which is that we cannot rely on the approval by a team member to determine that the JSON is "safe"—or even that the submitted information is correct!—as a malicious actor could edit the JSON block after approval from a team member. This seems like a security vulnerability that’s not immediately clear to me how to avoid.

K17-1003.html Outdated Show resolved Hide resolved
bin/process_bulk_metadata.py Outdated Show resolved Hide resolved
bin/process_bulk_metadata.py Outdated Show resolved Hide resolved
bin/process_bulk_metadata.py Outdated Show resolved Hide resolved
bin/process_bulk_metadata.py Show resolved Hide resolved
Comment on lines +156 to +160
for author in changes["authors"]:
attrib = {}
if "id" in real_ids:
# if the ID was explicitly represented, preserve it
attrib["id"] = author["id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks if the ID was associated with some author, but it may not be the one that is meant here. The scenario I have in mind is the following:

Let's say the original author list is

[
    {id: 'john-doe', first: 'John', last: 'Doe'},
    {first: 'Mary', last: 'Jane'}
]

If someone now uses the web interface to edit the first author to "Mary Jane", then deletes the second one, wouldn't the submitted JSON block look like

[
    {id: 'john-doe', first: 'Mary', last: 'Jane'}
]

Maybe this seems unlikely in this simple case, but I can easily see this happening when editing a list of 15+ authors.

But again maybe just something we need to be mindful of before approving?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a possible situation. But my goal with this PR was to get something working and then refine it. Less than 0.5% of the <author> lines in the Anthology XML (1,908 / 354,582) have an id attribute. I'm not sure this is worth worrying about at this point; partly I was planning to revisit this next year when we figure out how to do explicit representation of people, rather than inferring people from names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm not saying this is a blocker. I would expect an explicit representation of people to mean that essentially everyone has an ID, but of course we can wait until we have figured that out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought it through completely, but that's loosely what I have in mind. And then, separately, there will be an in-repo person database indexed by those IDs. Ingestion would then have to create new person entries, and we'd have to occasionally clean them out if any were orphaned.

bin/process_bulk_metadata.py Show resolved Hide resolved
bin/create_hugo_pages.py Outdated Show resolved Hide resolved
@mjpost
Copy link
Member

mjpost commented Dec 27, 2024

Thanks for the security-minded review. A few points:

  • It's definitely worth considering security and potential abuse from the outset
  • That said, the final PR is still subject to manual merging and review, which mitigates the security threat somewhat
  • Still, it seems we could setup a Github action that clears approval status if an issue is edited. Let me put one together and push it up

@mjpost mjpost requested review from mbollmann and Copilot December 27, 2024 14:57
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 9 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (4)
  • .github/ISSUE_TEMPLATE/01-metadata-correction.yml: Language not supported
  • hugo/assets/css/_papers.scss: Language not supported
  • hugo/layouts/_default/baseof.html: Language not supported
  • hugo/layouts/papers/single.html: Language not supported
Comments suppressed due to low confidence (2)

.github/ISSUE_TEMPLATE/99-bulk-metadata-correction.yml:10

  • The phrase 'activated by following the 'Fix metadata' link from each paper page in the Anthology' should be 'activated by following the 'Fix metadata' button from each paper page in the Anthology' for consistency.
activated by following the 'Fix metadata' link from each paper page in the Anthology

.github/workflows/annotate-metadata-issue.yml:58

  • [nitpick] The removal of the paper title in the comment reduces clarity. Consider including the paper title for better user experience.
Found ACL Anthology entry: https://aclanthology.org/${anthology_id}

validations:
required: true
- type: markdown
attributes:
value: |
**Note:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.
**Important:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.
Copy link
Preview

Copilot AI Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase 'If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.' should be 'If you request an author name correction for yourself, please ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.' for clarity.

Suggested change
**Important:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.
**Important:** If you request an author name correction for yourself, please ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
bin/process_bulk_metadata.py Outdated Show resolved Hide resolved
@nschneid nschneid dismissed their stale review December 27, 2024 15:09

addressed in 50fe7f9

@mbollmann
Copy link
Member

  • Still, it seems we could setup a Github action that clears approval status if an issue is edited. Let me put one together and push it up

Oh, how cool that this appears to be quite easy to do!

Comment on lines 248 to 250
collection_id, _, _ = deconstruct_anthology_id(anthology_id)
xml_path = f"data/xml/{collection_id}.xml"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I usually do is

scriptdir = os.path.dirname(os.path.abspath(__file__))
xml_path = f"{scriptdir}/../data/xml/{collection_id}.xml"

so that this code is not sensitive to your current working directory that you're calling the script from. I wonder if one "security check" then could be

assert os.path.realpath(xml_path) == os.path.realpath(f"{scriptdir}/../data/xml/")

or something to that effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do something similar, but I assumed that this was unnecessary since the file is being open by self.repo.get_contents, and would therefore be relative to the repo root. But I tried it from a different directory and it fails, so I'll add this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I had to create both: one path relative to the repo root (for reading from a branch with the GitHub module), and another absolute path (for writing to the actual XML file). I tested this by running in both bin/ and root.

@mbollmann
Copy link
Member

We should probably coordinate between this and #4192. Should we merge #4192 first, then merge master into autopr to run this against the new build?

@mjpost mjpost merged commit 8149399 into master Dec 27, 2024
2 checks passed
@mjpost mjpost deleted the autopr branch December 27, 2024 20:51
@mjpost
Copy link
Member

mjpost commented Dec 27, 2024

Sorry, @mbollmann, I merged this before seeing your note. I had the same thought, actually, but then counter-thought that aren't they mostly independent? Sorry if I complicated your merge.

@mbollmann
Copy link
Member

I’m not afraid of merge conflicts 😆 but once I merge #4192, let's double-check that the new metadata fixing functionality is still working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants