Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old style paper ids don't work #24

Open
cjcopi opened this issue Jun 15, 2015 · 7 comments
Open

Old style paper ids don't work #24

cjcopi opened this issue Jun 15, 2015 · 7 comments

Comments

@cjcopi
Copy link
Member

cjcopi commented Jun 15, 2015

Old style paper ids such as gr-qc/0103044v6 do not seem to be parsed correctly and are not searchable.

This shows up in the listing of June 15, 2015 for the update of the old paper "The Meaning of Einstein's Equation". PDF and Article links are not provided in the paper listing. This appears to be due to assuming that the ids will be of the form 'arxiv:ddddd' in cron.php where the article id is parsed from the title. It may be better to use the 'link' tag or the rdf:about attribute of the item tag. Of course you also are stripping the article id information from the title, so this will also need to be done with more care.

@jbmertens
Copy link
Member

Hm, interesting. The search actually isn't supported by arXiv, so I'm not sure we can easily fix that part of the issue.
Eg; http://arxiv.org/find/all/1/all:+0103044 or http://arxiv.org/find/all/1/all:+0103044v6

The arxiv "id" is indeed being extracted from the title here:
https://github.com/cwru-pat/coffee_stuff/blob/master/cron.php#L81
Unfortunately the ID isn't provided anywhere by arxiv, so will need to be extracted from the title or URL (which will always be unreliable).
It may be easiest to modify the regex to accept old-style IDs. The culprit behind the pdf links not working is probably the same regex being used to display titles + links/buttons, eg:
https://github.com/cwru-pat/coffee_stuff/blob/220f4413df3e808b614e13c19c765d3c0fcf11c4/private/functions.php#L6

@cjcopi
Copy link
Member Author

cjcopi commented Jun 15, 2015

The url is less reliable than the title? The arxiv api documentation says that the id can be found from the url by stripping the leading part
http://arxiv.org/help/api/user-manual#_entry_metadata

For the search you are not using the api?
http://export.arxiv.org/api/query?id_list=gr-qc/0103044v6
Of course this is specific to an id number. For a generic search you could use search_query but I'm not sure how to handle both cases in a single search box.

@jbmertens jbmertens added the bug label Jun 15, 2015
@jbmertens
Copy link
Member

The url is less reliable than the title? The arxiv api documentation says that the id can be found from the url by stripping the leading part
http://arxiv.org/help/api/user-manual#_entry_metadata

I'd trust that not to change about as much as I'd trust the title not to change :) I don't think either are reliable. Though, I'd be surprised if they changed much in the past decade, so they may be reliable in that sense.

For the search you are not using the api?
http://export.arxiv.org/api/query?id_list=gr-qc/0103044v6
Of course this is specific to an id number. For a generic search you could use search_query but I'm not sure how to handle both cases in a single search box.

The API is used, but only the search_query parameter is passed by our system (js/search.js#L92), so results are unfortunately empty for older IDs.
http://export.arxiv.org/api/query?search_query=gr-qc/0103044v6
This could be changed to try an ID search first, and then fall back to passing search_query when no paper is returned.

@cjcopi
Copy link
Member Author

cjcopi commented Jun 15, 2015

I would expect the url to be far, far more stable than the title. The url would be hard for them to change (it appears in many places) and at least appears in the api description as a method for getting the id. The title, on the other hand, is just text and I really don't know why the id appears in it at all. Of course they really should provide an id tag that covers this ....

I don't know how the search is used by people in practice. Writing a generic, powerful search would require more work and isn't justified unless there is a demand for it. However, being able to search for papers by id, including old ids, would be nice. That has been needed a few times during coffee (and I might use it more frequently if I knew it worked). So it would be nice if that could also be supported, though is not essential.

@jbmertens
Copy link
Member

I would expect the url to be far, far more stable than the title. The url would be hard for them to change (it appears in many places) and at least appears in the api description as a method for getting the id. The title, on the other hand, is just text and I really don't know why the id appears in it at all. Of course they really should provide an id tag that covers this ....

Ok; I'm getting a bit confused between searching/importing via API and importing via RSS. I dug a bit and here is a summary of the behavior:

  • Searching is done by hitting arxiv's api with the request passed using the search_query parameter.
    • Oddly, arXiv will not find papers with old ids via the parameter, but will find new ones.
    • No actual importing is done here.
  • Importing via search (using the arxiv API) does process the xml, and atttempts to use the id tag for determining the id and importing:
    js/search.js#L39
    • Importing breaks for old-style-id papers, which have a differently formatted url/"id".
  • The RSS import has the link (or rdf:about) field, which isn't the same as the API url/id. The RSS link does not include version information, so the ID is extracted from the title:
    cron.php#L83
    • This is again different for older papers (I think), and so may break things.

I can also imagine this breaking if the RSS title format changes, but I can also imagine additional routes like http://arxiv.org/abs/1502.06506/v2, http://arxiv.org/astro-ph/1502.06506v2, etc. becoming available and preferable to http://arxiv.org/abs/1502.06506v2, and the RSS link field reflecting this. So things are a bit messy and liable to break if arXiv changes things; let's just hope they don't :)

And so I'd like to change:

  • Attempt to search by passing in the query as an id_list, then reverting to search_query if there are no results.
  • Figure out how old API ids should be mapped to fields we store, and do something more appropriate.
  • Figure out how old RSS titles should be mapped to fields we store, and do something more appropriate.

@deskinsjt
Copy link
Member

Going down the rabbit hole, I agree that the RSS feed <link> tag doesn't include the version number but is there a reason we care about the version? The behavior of the arXiv is that the address arxiv.org/pdf/arxivnumber and arxiv.org/abs/arxivnumber will take you to the most recent version of the paper if no version is supplied. Further if we store the <link> tag and use that to generate the pdf and arXiv links we would just need to replace "abs" to "pdf." I think this would solve imported old papers links from showing incorrectly. Now for papers that are searched and imported the <id> tag is like the feeds <link> tag and thus my previous comments apply with that change.

@deskinsjt
Copy link
Member

Also what if we attempt to search with search_query and then if there are no-results try id_list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants