-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoPR for paper information correction #4147
Conversation
This change contains the following features: On each paper page, a collapsible is added at the bottom, with the indication "File a correction". When clicked on, it shows a form prefilled current of the paper Users can modify these information and click submit After submission, the user is taken to a window that creates pull request to the repository. The effect of the pr is to add file containing these new information that the user entered. The path to the file and the branch that the pr is aiming can be specified.
This merge contains the following features: On each paper page, a collapsible is added at the bottom, with the indication "File a correction". When clicked on, it shows a form prefilled current of the paper Users can modify these information and click submit After submission, the user is taken to a window that creates pull request to the repository. The effect of the pr is to add file containing these new information that the user entered. The path to the file and the branch that the pr is aiming can be specified.
Thanks for starting this off! We can deal with the UI elements once we get the functionality working. I see that this creates a PR against a Markdown file. However, the PR needs instead to be created against the relevant portion of the the XML files. For example, if I edit K17-1003, it should create a PR against the file data/xml/K17.xml, for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 3 changed files in this pull request and generated no suggestions.
Files not reviewed (2)
- hugo/layouts/papers/single.html: Language not supported
- hugo/layouts/partials/correction.html: Language not supported
At the moment, it seems to be trying to create a Markdown file, e.g., as pictured. But this Markdown file is not part of the repository; instead, it is an artifact of the build process, created by running The trick here is whether a pull request can be automated. If we can find a way to (a) grab the current relevant XML file (b) modify it according to the parameters and (c) create a PR with the new file, that would be sufficient. I suggest we take these one by one. |
Yes it makes sense. However, I have not yet found a solution to locate and edit a sepecific section of a file in this way. My current vision is that we can put all such .md files in a specific directory, and use a python script to parse the .xml files and execute the change regularly. |
Yes, that makes sense, this could be a workaround for that (though note that the information is not all passed on correctly at the moment). Can you summarize the research you've done about automating PRs? What is possible in this area? For example, we know that we can do the following in Javascript:
Would it be possible, then to create a PR, replacing the existing file with the new one? |
Build successful. Some useful links:
This preview will be removed when the branch is merged. |
|
<label for="paper_author">Paper authors: </label> | ||
{{ range $index, $element := .paper.author }} | ||
{{ if $index }} | ||
<label for="paper_author_first">First Name: </label> | ||
<input type="text" id="paper_author_first_{{$index}}" name="paper_author_first_{{$index}}" value="{{ $element.first }}"><br><br> | ||
|
||
<label for="paper_author_last">Last Name: </label> | ||
<input type="text" id="paper_author_last_{{$index}}" name="paper_author_last_{{$index}}" value="{{ $element.last }}"><br><br> | ||
|
||
<!-- Is full name always first + last name? --> | ||
<!-- <label for="paper_author_full">Full Name: </label> | ||
<input type="text" id="paper_author_full_{{$index}}" name="paper_author_full_{{$index}}" value="{{ $element.full }}"><br><br> --> | ||
{{ end }} | ||
{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two potential complications:
- What do we do with names that require disambiguation through author IDs? We probably don't want to expose the complexity of author IDs to end users...
- I understand that this is a prototype/proof-of-concept, but just a note that we’d ultimately need functionality to remove, add, and ideally reorder author fields as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID could be a hidden field embedded into the web page. As for the aesthetics and design, I have a new mockup I'll post below.
Cool that you're looking into concrete ideas for this! Regarding the .md files: I would definitely not want to add temporary Markdown files to the repository, if this is the suggestion. It clutters the commit history and I’m not sure it provides any benefit over simply opening a new issue containing this information. After all, we could also run a Python script that interacts with the Github API to read and parse issues, and then modifies the XML accordingly. I also feel that requiring the user suggesting changes to authenticate is a good thing, as it's a potential hurdle for abuse of this system. We shouldn't discount the security implications of such automation. Finally, if we need to find the line in an XML referring to a specific paper, there's always the option to embed that information into the website at build time. (Although this may run the risk that the information is outdated by the time the user clicks on the correction button.) |
To add: This was kinda my original idea with the issue templates, that we might be able to automate the creation of a PR from those. Unfortunately, people currently don't use them consistently enough for that, but I wouldn't discount that idea entirely. :) |
I agree that we shouldn't create new temporary files in the repo. If we can't directly submit a PR, I'd suggest that we would create a queue outside the repo that could then be handled by a script. But I prefer to have the users themselves create a PR using their own Github ID. |
I'm attaching a mockup of the interface (generated with the help of Claude) which adds author list manipulation. I'd prefer to have a button that would pop up a dialog containing this code, so everything would be self-contained. You can download and test this file. <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Paper Editor</title>
<link href="https://cdnjs.cloudflare.com/ajax/libs/tailwindcss/2.2.19/tailwind.min.css" rel="stylesheet">
<style>
.draggable:hover {
cursor: move;
background-color: #f3f4f6;
}
.dragging {
opacity: 0.5;
}
</style>
</head>
<body class="bg-gray-100 p-6">
<div class="max-w-4xl mx-auto bg-white rounded-lg shadow p-6">
<form id="paperForm" class="space-y-6">
<div>
<label class="block text-sm font-medium text-gray-700">Title</label>
<input type="text" id="title" class="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-500 focus:ring-indigo-500" value="Exploring the Syntactic Abilities of RNNs with Multi-task Learning">
</div>
<div>
<div class="flex justify-between items-center">
<label class="block text-sm font-medium text-gray-700">Authors</label>
<button type="button" id="addAuthor" class="px-4 py-2 text-sm font-medium text-gray-700 bg-white border border-gray-300 rounded-md hover:bg-gray-50">
Add Author
</button>
</div>
<div id="authorsList" class="mt-2 space-y-3">
<!-- Authors will be added here -->
</div>
</div>
<div>
<label class="block text-sm font-medium text-gray-700">Abstract</label>
<textarea id="abstract" rows="6" class="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-500 focus:ring-indigo-500">Recent work has explored the syntactic abilities of RNNs using the subject-verb agreement task, which diagnoses sensitivity to sentence structure. RNNs performed this task well in common cases, but faltered in complex sentences (Linzen et al., 2016). We test whether these errors are due to inherent limitations of the architecture or to the relatively indirect supervision provided by most agreement dependencies in a corpus. We trained a single RNN to perform both the agreement task and an additional task, either CCG supertagging or language modeling. Multi-task training led to significantly lower error rates, in particular on complex sentences, suggesting that RNNs have the ability to evolve more sophisticated syntactic representations than shown before. We also show that easily available agreement training data can improve performance on other syntactic tasks, in particular when only a limited amount of training data is available for those tasks. The multi-task paradigm can also be leveraged to inject grammatical knowledge into language models.</textarea>
</div>
<input type="hidden" id="anthology_id" value="K17-1003">
<div class="flex justify-end space-x-4">
<button type="button" class="px-4 py-2 text-sm font-medium text-gray-700 bg-white border border-gray-300 rounded-md hover:bg-gray-50">
Cancel
</button>
<button type="submit" class="px-4 py-2 text-sm font-medium text-white bg-indigo-600 border border-transparent rounded-md hover:bg-indigo-700">
Submit
</button>
</div>
</form>
</div>
<script>
// Initial authors data
const initialAuthors = [
{ first: "Émile", last: "Enguehard", affiliation: "" },
{ first: "Yoav", last: "Goldberg", affiliation: "" },
{ first: "Tal", last: "Linzen", affiliation: "" }
];
let authors = [...initialAuthors];
let draggedItem = null;
function createAuthorRow(author, index) {
const div = document.createElement('div');
div.className = 'draggable flex gap-4 items-center p-4 bg-gray-50 rounded-lg';
div.draggable = true;
div.innerHTML = `
<input type="text" placeholder="First Name" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.first}">
<input type="text" placeholder="Last Name" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.last}">
<input type="text" placeholder="Affiliation" class="flex-1 px-3 py-2 rounded-md border-gray-300" value="${author.affiliation}">
<button type="button" class="remove-author px-2 py-1 text-red-600 hover:text-red-800">×</button>
`;
// Add drag and drop event listeners
div.addEventListener('dragstart', (e) => {
draggedItem = index;
div.classList.add('dragging');
});
div.addEventListener('dragend', () => {
div.classList.remove('dragging');
});
div.addEventListener('dragover', (e) => {
e.preventDefault();
});
div.addEventListener('drop', (e) => {
e.preventDefault();
const dropIndex = index;
if (draggedItem !== dropIndex) {
const temp = authors[draggedItem];
authors.splice(draggedItem, 1);
authors.splice(dropIndex, 0, temp);
renderAuthors();
}
});
// Add remove button listener
div.querySelector('.remove-author').addEventListener('click', () => {
authors.splice(index, 1);
renderAuthors();
});
// Add input change listeners
const inputs = div.querySelectorAll('input');
inputs[0].addEventListener('change', (e) => authors[index].first = e.target.value);
inputs[1].addEventListener('change', (e) => authors[index].last = e.target.value);
inputs[2].addEventListener('change', (e) => authors[index].affiliation = e.target.value);
return div;
}
function renderAuthors() {
const authorsList = document.getElementById('authorsList');
authorsList.innerHTML = '';
authors.forEach((author, index) => {
authorsList.appendChild(createAuthorRow(author, index));
});
}
// Initialize authors list
renderAuthors();
// Add author button
document.getElementById('addAuthor').addEventListener('click', () => {
authors.push({ first: '', last: '', affiliation: '' });
renderAuthors();
});
// Form submission
document.getElementById('paperForm').addEventListener('submit', (e) => {
e.preventDefault();
const params = new URLSearchParams();
params.append('title', document.getElementById('title').value);
params.append('abstract', document.getElementById('abstract').value);
authors.forEach((author, index) => {
params.append(`author${index + 1}_first`, author.first);
params.append(`author${index + 1}_last`, author.last);
params.append(`author${index + 1}_affiliation`, author.affiliation);
});
params.append('anthology_id', document.getElementById('anthology_id').value);
const url = `https://aclanthology.org/edit?${params.toString()}`;
alert(url);
});
</script>
</body>
</html> |
@mbollmann I've only looked into this briefly, but I am starting to think that we won't be able to automate the creation of PRs. With the issue templates, what strikes me is how often people fail to use them appropriate, for example, failing to substitute what are to me at least clearly variable templates. What you wrote here suggests to me another workflow:
Based on this PR, we :
I'm thinking that this workflow might actually work. It would also have additional advantages: we could batch the PRs, say on a weekly basis, instead of having to manage (approve and merge) each PR one-by-one. We could also manage a lot of the complexity of tag-handling and IDs in Python code on the server. Our issues board is already swamped with metadata change requests. This could really help with this and would complete the work you started earlier. (Apologies if this is what you had in mind, this is only now crystalizing for me). |
Yes, but then we're at a point where I'm not sure we even need to implement our own dialog at all. We could just embed a link that opens a new issue pre-filled with the current title/author list etc., like so: Since we can't enforce that people don't change anything in the issue text, even if we generate it from a dialog on our website, it might be a better approach to validate that with a Github Action or script. It could even automatically post on these issues if there's a problem that people should fix... |
Okay, this is great, it seems we are finally homing in on a solution to this problem that has long dogged us. We may yet achieve this in Q4! Directly linking to prefilled templates like this is a great idea. My main concern is that I don't trust users to make correct edits to those templates while maintaining machine-readability. The state of the current board is a good example of this (pictured above), where people interpret fields differently. We deal with lots of problems with author name formatting, and so on. It seems better to me to use a modern web UI dialog where we present only the most basic fields that capture 90%+ of use cases and let the user make the changes that way. This avoids the manual error prone in unconstrained manipulation of text fields. This would also allow us to handle, e.g., author ID tags, which would just be hidden from the user (or could be a drop-down list), and any other arbitrary field We would still instantiate the template, but it would be marked all over with language mentioning that the fields are meant to be read-only to preserve machine-readability. We couldn't stop users from changing them, of course, but likely at this point, only those who knew what they were doing would. In fact, the template could just be a single YAML block containing all the updated information:
We could then have a weekly script that gathers these issues and creates a batch PR which would require human approval. |
Oh, absolutely, but I wouldn't trust them not to add irrelevant comments in a web UI that we create either. :) I think we have to expect and handle these cases in the downstream process either way. |
What if we only want to approve some of these changes though? |
The script could require some sign of approval from an Anthology team member, e.g., the addition of a label. @nschneid often performs this approval service on PRs, for example, checking that metadata corrections fit the PDF (I'd also love to add an LLM bot that does the same). It would be good to start conservatively. |
Would be happy to be replaced by a bot. :) |
FWIW, I was playing around once with making a Github Action that posts a direct link to the paper based on the Anthology ID mentioned in the issue, but I seemed a bit tricky to do. Maybe if you write a proper script it would be more feasible. A bot posting an image of the first page of the PDF would probably be even more helpful :) |
That's a nice idea that would simplify the checking, even if we still rely on @nschneid to do it. A simple thing to do is just grab the thumbnail, e.g., https://aclanthology.org/thumb/K17-1003.jpg |
Sorry, to be more clear, I meant whether there is an <author id="fei-liu-gga"><first>Fei</first><last>Liu</last></author> I have implemented the following solution:
In the future, it sounds like I can use However, I am hoping that we can merge this PR, and port to the new library in Q1 next year, perhaps under the discussed "sink-or-swim" scenario. Note that I also updated the bulk processing script to close old-style metadata correction issues with a note, asking authors to resubmit their issue using the new system. You can see an example on #4208. It's very nice and I hope will really clear out our huge number of issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to read the bulk processing script now primarily from a security perspective, and I see one issue, which is that we cannot rely on the approval by a team member to determine that the JSON is "safe"—or even that the submitted information is correct!—as a malicious actor could edit the JSON block after approval from a team member. This seems like a security vulnerability that’s not immediately clear to me how to avoid.
for author in changes["authors"]: | ||
attrib = {} | ||
if "id" in real_ids: | ||
# if the ID was explicitly represented, preserve it | ||
attrib["id"] = author["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only checks if the ID was associated with some author, but it may not be the one that is meant here. The scenario I have in mind is the following:
Let's say the original author list is
[
{id: 'john-doe', first: 'John', last: 'Doe'},
{first: 'Mary', last: 'Jane'}
]
If someone now uses the web interface to edit the first author to "Mary Jane", then deletes the second one, wouldn't the submitted JSON block look like
[
{id: 'john-doe', first: 'Mary', last: 'Jane'}
]
Maybe this seems unlikely in this simple case, but I can easily see this happening when editing a list of 15+ authors.
But again maybe just something we need to be mindful of before approving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a possible situation. But my goal with this PR was to get something working and then refine it. Less than 0.5% of the <author>
lines in the Anthology XML (1,908 / 354,582) have an id attribute. I'm not sure this is worth worrying about at this point; partly I was planning to revisit this next year when we figure out how to do explicit representation of people, rather than inferring people from names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm not saying this is a blocker. I would expect an explicit representation of people to mean that essentially everyone has an ID, but of course we can wait until we have figured that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought it through completely, but that's loosely what I have in mind. And then, separately, there will be an in-repo person database indexed by those IDs. Ingestion would then have to create new person entries, and we'd have to occasionally clean them out if any were orphaned.
Thanks for the security-minded review. A few points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 9 out of 13 changed files in this pull request and generated 1 comment.
Files not reviewed (4)
- .github/ISSUE_TEMPLATE/01-metadata-correction.yml: Language not supported
- hugo/assets/css/_papers.scss: Language not supported
- hugo/layouts/_default/baseof.html: Language not supported
- hugo/layouts/papers/single.html: Language not supported
Comments suppressed due to low confidence (2)
.github/ISSUE_TEMPLATE/99-bulk-metadata-correction.yml:10
- The phrase 'activated by following the 'Fix metadata' link from each paper page in the Anthology' should be 'activated by following the 'Fix metadata' button from each paper page in the Anthology' for consistency.
activated by following the 'Fix metadata' link from each paper page in the Anthology
.github/workflows/annotate-metadata-issue.yml:58
- [nitpick] The removal of the paper title in the comment reduces clarity. Consider including the paper title for better user experience.
Found ACL Anthology entry: https://aclanthology.org/${anthology_id}
validations: | ||
required: true | ||
- type: markdown | ||
attributes: | ||
value: | | ||
**Note:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview. | ||
**Important:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase 'If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.' should be 'If you request an author name correction for yourself, please ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview.' for clarity.
**Important:** If you request an author name correction for yourself, please help ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview. | |
**Important:** If you request an author name correction for yourself, please ensure your name is correct for future publications by setting it correctly in Softconf/OpenReview. |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
Oh, how cool that this appears to be quite easy to do! |
bin/process_bulk_metadata.py
Outdated
collection_id, _, _ = deconstruct_anthology_id(anthology_id) | ||
xml_path = f"data/xml/{collection_id}.xml" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I usually do is
scriptdir = os.path.dirname(os.path.abspath(__file__))
xml_path = f"{scriptdir}/../data/xml/{collection_id}.xml"
so that this code is not sensitive to your current working directory that you're calling the script from. I wonder if one "security check" then could be
assert os.path.realpath(xml_path) == os.path.realpath(f"{scriptdir}/../data/xml/")
or something to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually do something similar, but I assumed that this was unnecessary since the file is being open by self.repo.get_contents
, and would therefore be relative to the repo root. But I tried it from a different directory and it fails, so I'll add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I had to create both: one path relative to the repo root (for reading from a branch with the GitHub module), and another absolute path (for writing to the actual XML file). I tested this by running in both bin/
and root.
Sorry, @mbollmann, I merged this before seeing your note. I had the same thought, actually, but then counter-thought that aren't they mostly independent? Sorry if I complicated your merge. |
I’m not afraid of merge conflicts 😆 but once I merge #4192, let's double-check that the new metadata fixing functionality is still working. |
This PR implements a semi-automated bulk processing of metadata corrections. We hope that it will vastly simplify the metadata correction experience for both users and Anthology staff. It includes the following pieces:
process_bulk_metadata.py
, processes these work items, makes the corrections, bundles them into a branch, and creates a unified PRTODO items:
Figure out why the text on the yellow "Fix metadata" button is white, not blackClose old metadata change requests with a note to use the new format (Clean up metadata issues #4175)