Tools for Misfigured Urls #18

greebie · 2019-02-05T18:37:32Z

There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).

For example,

https://www.example.com/index.html.Beginning_of_following_paragraph which could be resolved by accepting only one period after the url, except that

https://www.example.com/index.htmlBeginning_of_following_paragraph would still not be resolved.

I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.

The text was updated successfully, but these errors were encountered:

greebie · 2019-10-17T14:51:36Z

Given discussion in #28, we should include operations for reading xml and html as well.

adam3smith · 2019-10-17T15:03:44Z

I'll work on html and xml parsing

adam3smith added the enhancement New feature or request label Feb 5, 2019

greebie mentioned this issue Oct 17, 2019

Use readtext for other extensions #28

Merged

greebie mentioned this issue Oct 17, 2019

More Elegant Exit when result returns NULL #29

Closed

adam3smith mentioned this issue Oct 17, 2019

Separately parse xml and html documents #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools for Misfigured Urls #18

Tools for Misfigured Urls #18

greebie commented Feb 5, 2019

greebie commented Oct 17, 2019

adam3smith commented Oct 17, 2019

Tools for Misfigured Urls #18

Tools for Misfigured Urls #18

Comments

greebie commented Feb 5, 2019

greebie commented Oct 17, 2019

adam3smith commented Oct 17, 2019