For this homework, you will get all of the well-formed URLs from a snippet of HTML.
Eventually, we want to be able to crawl the web and parse web pages. As such, we must be able to find the URLs (or links) embedded in the HTML code of the web pages. This assignment is a step towards this functionality. Specifically, you must return a list of HTTP or HTTPS URLs (or links) from the HTML a
anchor tag href
attribute. This should not include links in the href
attribute of the link
tag!
You have been provided with a method that will make sure URLs are in a semi-consistent properly-encoded form. You will need to figure out how to deal with relative vs absolute links, however. See below for more.
There are many helpful resources on the web to learn more about HTML. Some resources include:
You will need to be familiar with the anchor tag <a>
for this assignment. Some resources include:
The anchor tag is used to create links on web pages. For example:
<a href="https://www.cs.usfca.edu/">USF CS</a>
The above code will generate the link USF CS, where the link text is USF CS
and the link destination is https://www.cs.usfca.edu/
. The link will always be placed in the href
attribute of the a
tag, but not all a
tags will have this attribute. For example, this is a valid a
tag without the href
attribute:
<a name="home" class="bookmark">Home</a>
And, the href
attribute may appear in other tags. For example, this is a valid link
tag to include a style sheet:
<link rel="stylesheet" type="text/css" href="style.css">
The majority of URLs on webpages are relative (i.e. specified relative to the current webpage URL). You will need to convert those relative URLs into an absolute URL. For this, you may use the java.net.URL
class. For example, consider the following:
URL base = new URL("http://www.cs.usfca.edu/~sjengle/cs212/");
URL absolute = new URL(base, "../index.html");
// outputs http://www.cs.usfca.edu/~sjengle/index.html
System.out.println(absolute);
This works even if the provided string was already absolute. For example:
URL base = new URL("http://www.cs.usfca.edu/~sjengle/cs212/");
URL absolute = new URL(base, "http://www.example.com/");
// outputs http://www.example.com/
System.out.println(absolute);
Because of this, you do not need to test if a link was relative or absolute. You can simply always use the above code.
See the Javadoc and TODO
comments in the template code in the src/main/java
directory for additional details. You must pass the tests provided in the src/test/java
directory. Do not modify any of the files in the src/test
directory.
See the Homework Guides for additional details on homework requirements and submission.