Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools for Misfigured Urls #18

Open
greebie opened this issue Feb 5, 2019 · 2 comments
Open

Tools for Misfigured Urls #18

greebie opened this issue Feb 5, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@greebie
Copy link
Collaborator

greebie commented Feb 5, 2019

There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).

For example,

https://www.example.com/index.html.Beginning_of_following_paragraph which could be resolved by accepting only one period after the url, except that

https://www.example.com/index.htmlBeginning_of_following_paragraph would still not be resolved.

I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.

@adam3smith adam3smith added the enhancement New feature or request label Feb 5, 2019
@greebie
Copy link
Collaborator Author

greebie commented Oct 17, 2019

Given discussion in #28, we should include operations for reading xml and html as well.

@adam3smith
Copy link
Contributor

I'll work on html and xml parsing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants