diff --git a/.gitignore b/.gitignore index 31d0456a..de6dea63 100644 --- a/.gitignore +++ b/.gitignore @@ -159,8 +159,7 @@ cython_debug/ # option (not recommended) you can uncomment the following to ignore the entire idea folder. .idea/ -#pkl -*.pkl + #log log diff --git a/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.md b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.md index 9f98ce6d..6c84bff1 100644 --- a/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.md +++ b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.md @@ -1,4 +1,6 @@ -1. (7.0 points) What Would Python Display? Assume the following code has been executed. The Link class appears on the midterm 2 study guide (page 2, left side). def shake(it): if it is not Link.empty and it.rest is not Link.empty: if it.first + 1 < it.rest.first: it.rest = Link(it.rest.first-1, it.rest) shake(it) else: shake(it.rest) it = Link(2, Link(5, Link(7))) off = Link(1, it.rest) shake(it) def cruel(summer): while summer is not Link.empty: yield summer.first summer = summer.rest if summer is not Link.empty: summer = summer.rest summer = Link(1, Link(2, Link(3, Link(4)))) Write the output printed for each expression below or _Error_ if an error occurs. 1. (2.0 pt) print(it) <2 5 7> <2 4 5 7> <2 4 5 6 7> <2 3 4 5 7> <2 4 3 5 7> <2 3 4 5 6 7> <2 4 3 5 6 7> (2.0 pt) print(off) <1 5 6 7> (2.0 pt) print([x*x for x in cruel(summer)]) [1, 9] +1. (7.0 points) **What Would Python Display?** Assume the following code has been executed. The Link class appears on the midterm 2 study guide (page 2, left side). def shake(it): if it is not Link.empty and it.rest is not Link.empty: if it.first + 1 < it.rest.first: it.rest = Link(it.rest.first-1, it.rest) shake(it) else: shake(it.rest) it = Link(2, Link(5, Link(7))) off = Link(1, it.rest) shake(it) def cruel(summer): while summer is not Link.empty: yield summer.first summer = summer.rest if summer is not Link.empty: summer = summer.rest summer = Link(1, Link(2, Link(3, Link(4)))) Write the output printed for each expression below or _Error_ if an error occurs. 1. (2.0 pt) print(it) <2 5 7> <2 4 5 7> <2 4 5 6 7> <2 3 4 5 7> <2 4 3 5 7> <2 3 4 5 6 7> <2 4 3 5 6 7> (2.0 pt) print(off) <1 5 6 7> (2.0 pt) print([x*x for x in cruel(summer)]) + +[1, 9] **(d) (1.0 pt)** What is the order of growth of the time it takes to evaluate shake(Link(1, Link(n))) in terms of n? diff --git a/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pdf b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pdf index 40bdb298..0990adf4 100644 Binary files a/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pdf and b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pdf differ diff --git a/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pkl b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pkl new file mode 100644 index 00000000..1bae5c50 Binary files /dev/null and b/output_tmp/expected_output/61a-sp24-mt2_sol_2_pages/61a-sp24-mt2_sol_2_pages.pkl differ diff --git a/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.md b/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.md new file mode 100644 index 00000000..0ca089cb --- /dev/null +++ b/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.md @@ -0,0 +1,602 @@ +[![Open In Colab](../images/colab-badge.svg)](https://colab.research.google.com/github/MonashDataFluency/python-web-scraping/blob/master/notebooks/section-0-brief-python-refresher.ipynb) + +
+
https://xkcd.com/license.html (CC BY-NC 2.5)
+ +### Jupyter-style notebooks on Google Colaboratory - A quick tour +--- + +1. Go to [this](https://colab.research.google.com) link and login with your Google account. +2. Select **NEW NOTEBOOK** - a new Python3 notebook will be created. +3. Type some Python code in the top cell, eg: + + +```python +print("Hello world!!") +``` + + Hello world!! + + +4. **Shift-Enter** to run the contents of the cell. + +In this section we will take a quick tour of some of the concepts that will be used for the rest of our web scraping workshop. + + +### Dataframes +--- + +One of the most powerful data structures in Python is the Pandas `DataFrame`. It allows tabular data, including `csv` (comma seperated values) and `tsv` (tab seperated values), to be processed and manipulated. + +People familiar with Excel will no doubt find it intuitive and easy to grasp. Since most `csv` (or `tsv`) has become the de facto standard for sharing datasets both large and small, Pandas dataframe is the way to go. + + +```python +import pandas as pd # importing the package and using `pd` as the alias +print('Pandas version : {}'.format(pd.__version__)) +``` + + Pandas version : 1.0.1 + + +Suppose we wanted to create a dataframe as follows, + +| name | title | +|------|-----------| +| sam | physicist | +| rob | economist | + +Let's create a dictionary with the headers as keys and their corresponding values as a list as follows, + + +```python +data = {'name': ['sam', 'rob'], 'title': ['physicist', 'economist']} +``` + +Converting the same to a dataframe, + + +```python +df = pd.DataFrame(data) +# Note: converting to markdown for ease of display on site +# print(df._to_markdown()) +df +``` + +| | name | title | +|---:|:-------|:----------| +| 0 | sam | physicist | +| 1 | rob | economist | + + +Now lets create a bigger dataframe and learn some useful functions that can be performed on them. + + +```python +data = {'Name': ['Sam', 'Rob', 'Jack', 'Jill', 'Dave', 'Alex', 'Steve'],\ + 'Title': ['Physicist', 'Economist', 'Statistician', 'Data Scientist', 'Designer', 'Architect', 'Doctor'], \ + 'Age': [59, 66, 42, 28, 24, 39, 52],\ + 'City': ['Melbourne', 'Melbourne', 'Sydney', 'Sydney', 'Melbourne', 'Perth', 'Brisbane'],\ + 'University': ['Monash', 'Monash', 'UNSW', 'UTS', 'Uni Mel', 'UWA', 'UQ']} +``` + + +```python +df = pd.DataFrame(data) +df +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:----------|:-------------| +| 0 | Sam | Physicist | 59 | Melbourne | Monash | +| 1 | Rob | Economist | 66 | Melbourne | Monash | +| 2 | Jack | Statistician | 42 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | +| 4 | Dave | Designer | 24 | Melbourne | Uni Mel | +| 5 | Alex | Architect | 39 | Perth | UWA | +| 6 | Steve | Doctor | 52 | Brisbane | UQ | + + +We can also take a quick glance at its contents by using : + +- `df.head()` : To display the first 5 rows + +- `df.tail()` : To display the last 5 rows + + +```python +df.head() +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:----------|:-------------| +| 0 | Sam | Physicist | 59 | Melbourne | Monash | +| 1 | Rob | Economist | 66 | Melbourne | Monash | +| 2 | Jack | Statistician | 42 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | +| 4 | Dave | Designer | 24 | Melbourne | Uni Mel | + + +```python +df.tail() +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:----------|:-------------| +| 2 | Jack | Statistician | 42 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | +| 4 | Dave | Designer | 24 | Melbourne | Uni Mel | +| 5 | Alex | Architect | 39 | Perth | UWA | +| 6 | Steve | Doctor | 52 | Brisbane | UQ | + +Lets say we want to fiter out all the people from `Sydney`. + + +```python +df[df['City'] == 'Sydney'] +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:-------|:-------------| +| 2 | Jack | Statistician | 42 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | + + +Now, lets say we want to look at all the people in `Melbourne` and in `Monash`. Notice the usage of `()` and `&`. + + +```python +df[(df['City'] == 'Melbourne') & (df['University'] == 'Monash')] +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:----------|------:|:----------|:-------------| +| 0 | Sam | Physicist | 59 | Melbourne | Monash | +| 1 | Rob | Economist | 66 | Melbourne | Monash | + + +#### Challenge +--- + +How can we filter people from `Melbourne` and above the Age of `50`? + +--- + +We can also fetch specific rows based on their indexes as well. + + +```python +df.iloc[1:3] +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:-------------|------:|:----------|:-------------| +| 1 | Rob | Economist | 66 | Melbourne | Monash | +| 2 | Jack | Statistician | 42 | Sydney | UNSW | + + +Lets try changing Jack's age to 43, because today is his Birthday and he has now turned 43. + + +```python +df.loc[2, 'Age'] = 43 +``` + +The above is just one way to do this. Some of the other ways are as follows: + +- `df.at[2, 'Age'] = 43` + +- `df.loc[df[df['Name'] == 'Jack'].index, 'Age'] = 43` + +Lets look at the updated data frame. + + +```python +df +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:----------|:-------------| +| 0 | Sam | Physicist | 59 | Melbourne | Monash | +| 1 | Rob | Economist | 66 | Melbourne | Monash | +| 2 | Jack | Statistician | 43 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | +| 4 | Dave | Designer | 24 | Melbourne | Uni Mel | +| 5 | Alex | Architect | 39 | Perth | UWA | +| 6 | Steve | Doctor | 52 | Brisbane | UQ | + +For exporting a Pandas dataframe to a `csv` file, we can use `to_csv()` as follows +```python +df.to_csv(filename, index=False) +``` + +Lets try writing our data frame to a file. + + +```python +df.to_csv('researchers.csv', index=False) +``` + +In order to read external files we use `read_csv()` function, +```python +pd.read_csv(filename, sep=',') +``` + +We can read back the file that we just created. + + +```python +df_res = pd.read_csv('researchers.csv', sep=',') +df_res +``` + +| | Name | Title | Age | City | University | +|---:|:-------|:---------------|------:|:----------|:-------------| +| 0 | Sam | Physicist | 59 | Melbourne | Monash | +| 1 | Rob | Economist | 66 | Melbourne | Monash | +| 2 | Jack | Statistician | 43 | Sydney | UNSW | +| 3 | Jill | Data Scientist | 28 | Sydney | UTS | +| 4 | Dave | Designer | 24 | Melbourne | Uni Mel | +| 5 | Alex | Architect | 39 | Perth | UWA | +| 6 | Steve | Doctor | 52 | Brisbane | UQ | + + +### JSON +--- + +JSON stands for *JavaScript Object Notation*. + +When exchanging data between a browser and a server, the data can only be text. + +Python has a built-in package called `json`, which can be used to work with JSON data. + + +```python +import json +``` + +Once we imported the library, now lets look at how we can obtain a JSON object from a string (or more accurately a JSON string). This process is also know as deserialization where we convert a string to an object. + +JSON is a string that can be turned into ('deserialized') a Python dictionary containing just primitive types (floats, strings, bools, lists, dicts and None). + + +```python +# Convert from JSON to Python: + +# some JSON: +x = '{ "name":"John", "age":30, "city":"New York"}' + +# parse x: +y = json.loads(x) + +# the result is a Python dictionary: +print(y["age"]) +``` + + 30 + + +Lets take a look at `y` as follows, + + +```python +print(y) +``` + + { + "name": "John", + "age": 30, + "city": "New York" + } + + +We can obtain the exact same JSON string we defined earlier from a Python dictionary as follows, + + +```python +# Convert from Python to JSON + +# a Python object (dict): +x = { + "name": "John", + "age": 30, + "city": "New York" +} + +# convert into JSON: +y = json.dumps(x) + +# the result is a JSON string: +print(y) +``` + + {"name": "John", "age": 30, "city": "New York"} + + +For better formatting we can indent the same as, + + +```python +# Indentation +y = json.dumps(x, indent=4) +print(y) + +``` + + { + "name": "John", + "age": 30, + "city": "New York" + } + + +### Regex +--- + +Regular expressions or regex are a powerful tool to extract key pieces of data from raw text. + +This is a whirlwind tour of regular expressions so you have some familiarity, but becoming proficient will require further study and practise. Regex is deep enough to have a whole workshop of it's own + +You can try your regex expressions in : + +- [pythex](https://pythex.org/) for a python oriented regex editor +- [regexr](https://regexr.com/) for a more visual explanation behind the expressions (good for getting started) + + +```python +import re # regex package in python is named 're' +``` + + +```python +my_str = 'python123good' +re.search('123', my_str) +``` + + + + + + + + + +```python +if re.search('123', my_str): + print("Found") +else: + print("Not found") +``` + + Found + + +We can use `[0-9]` in the regular expression to identify any one number in the string. + + +```python +my_str = 'python123good' +re.search('[0-9]', my_str) +``` + + + + + + + + +Notice that it matches the first occurance only. + +Now, the above regex can be modified to match any three numbers in a string. + + +```python +my_str = 'python123good' +re.search('[0-9][0-9][0-9]', my_str) +``` + + + + + + + + + +```python +print(re.search('[0-9][0-9][0-9]','hello123')) # matches 123 +print(re.search('[0-9][0-9][0-9]','great678python')) # matches 678 +print(re.search('[0-9][0-9][0-9]','01234webscraing')) # matches 012 +print(re.search('[0-9][0-9][0-9]','01web5678scraing')) # matches 567 +print(re.search('[0-9][0-9][0-9]','hello world')) # matches nothing +``` + + + + + + None + + +As seen above, it matches the first occurance of three digits occuring together. + +The above example can be extended to match any number of numbers using the wild character `*` which matches `zero or more repetitions`. + + +```python +print(re.search('[a-z]*[0-9]*','hello123@@')) # matches hello123 +``` + + + + +What if we just want to capture only the numbers? `Capture group` is the answer. + + +```python +num_regex = re.compile('[a-z]*([0-9]*)[a-z]*') +my_str = 'python123good' +num_regex.findall(my_str) +``` + + + + + ['123', ''] + + + +We see that it matchs an empty string as well because `*` matches zero or more occurances. + +To avoid this, we can use `+` which matches one or more occurances. + + +```python +num_regex = re.compile('[a-z]*([0-9]+)[a-z]*') +my_str = 'python123good' +num_regex.findall(my_str) +``` + + + + + ['123'] + + + +We can use `^` and `$` to match at the beginning and end of string. + +As shown in the below 2 examples, we use `^` to get the numbers by which the string is starting. + + +```python +num_regex = re.compile('^([0-9]+)[a-z]*') +my_str = '123good' +num_regex.findall(my_str) +``` + + + + + ['123'] + + + + +```python +my_str = 'hello123good' +num_regex.findall(my_str) +``` + + + + + [] + + + +#### Challenge +--- + +What regular expression can be used to match the numbers **only** at the end of the string: + +- `'[a-z]*([0-9]+)[a-z]+'` +- `'[a-z]*([0-9]+)$'` +- `'$[a-z]*([0-9]+)'` +- `'([0-9]+)'` + +--- + +Now, having learnt the regular expressions on basic strings, the same concept can be applied to an html as well as shown below: + + +```python +html = r''' + + + + site + + + +

Sam

+

Physicist

+

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.

+ +

Rob

+

Economist

+

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et.

+ + + + + ''' + +print(html) +``` + + + + + + site + + + +

Sam

+

Physicist

+

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.

+ +

Rob

+

Economist

+

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et.

+ + + + + + + +Now, if we are only interested in : +- names i.e. the data inside the `

` tags, and +- title i.e. the data inside the `

` tags +we can extract the same using regex. + + +Lets define the expressions (or patterns) to capture all text between the tags as follows : + +- `

(.*?)

` : capture all text contained within `

` tags +- `

(.*?)

` : capture all text contained within `

` tags + + + + +```python +regex_h1 = re.compile('

(.*?)

') +regex_h2 = re.compile('

(.*?)

') +``` + +and use `findall()` to return all the instances that match with our pattern, + + +```python +names = regex_h1.findall(html) +titles = regex_h2.findall(html) + +print(names, titles) +``` + + ['Sam', 'Rob'] ['Physicist', 'Economist'] + + +### From a web scraping perspective +- `JSON` and `XML` are the most widely used formats to carry data all over the internet. +- To work with `CSV`s (or `TSV`s), Pandas DataFrames are the de facto standard. +- Regexes help us extract key pieces of information (sub-strings) from raw, messy and unstructured text (strings). + +### References + +- https://xkcd.com/353/ +- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html +- https://realpython.com/regex-python/ \ No newline at end of file diff --git a/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.pkl b/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.pkl new file mode 100644 index 00000000..99a2bbe2 Binary files /dev/null and b/output_tmp/expected_output/section-0-brief-python-refresher/section-0-brief-python-refresher.pkl differ diff --git a/output_tmp/expected_output/section-3-API-based-scraping/section-3-API-based-scraping.md b/output_tmp/expected_output/section-3-API-based-scraping/section-3-API-based-scraping.md new file mode 100644 index 00000000..5a5c39c9 --- /dev/null +++ b/output_tmp/expected_output/section-3-API-based-scraping/section-3-API-based-scraping.md @@ -0,0 +1,726 @@ +[![Open In Colab](../images/colab-badge.svg)](https://colab.research.google.com/github/MonashDataFluency/python-web-scraping/blob/master/notebooks/section-3-API-based-scraping.ipynb) + + + +### A brief introduction to APIs +--- + +In this section, we will take a look at an alternative way to gather data than the previous pattern based HTML scraping. Sometimes websites offer an API (or Application Programming Interface) as a service which provides a high level interface to directly retrieve data from their repositories or databases at the backend. + +From Wikipedia, + +> "*An API is typically defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.*" + +They typically tend to be URL endpoints (to be fired as requests) that need to be modified based on our requirements (what we desire in the response body) which then returns some a payload (data) within the response, formatted as either JSON, XML or HTML. + +A popular web architecture style called `REST` (or representational state transfer) allows users to interact with web services via `GET` and `POST` calls (two most commonly used) which we briefly saw in the previous section. + +For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data. + +There are primarily two ways to use APIs : + +- Through the command terminal using URL endpoints, or +- Through programming language specific *wrappers* + +For example, `Tweepy` is a famous python wrapper for Twitter API whereas `twurl` is a command line interface (CLI) tool but both can achieve the same outcomes. + +Here we focus on the latter approach and will use a Python library (a wrapper) called `wptools` based around the original MediaWiki API. + +One advantage of using official APIs is that they are usually compliant of the terms of service (ToS) of a particular service that researchers are looking to gather data from. However, third-party libraries or packages which claim to provide more throughput than the official APIs (rate limits, number of requests/sec) generally operate in a gray area as they tend to violate ToS. Always be sure to read their documentation throughly. + +### Wikipedia API +--- + +Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. One very good place to start would be to look at the **infoboxes** (as wikipedia defines them) of articles corresponsing to each company on the list. They essentially contain a wealth of metadata about a particular entity the article belongs to which in our case is a company. + +For e.g. consider the wikipedia article for **Walmart** (https://en.wikipedia.org/wiki/Walmart) which includes the following infobox : + +![An infobox](../images/infobox.png) + +As we can see from above, the infoboxes could provide us with a lot of valuable information such as : + +- Year of founding +- Industry +- Founder(s) +- Products +- Services +- Operating income +- Net income +- Total assets +- Total equity +- Number of employees etc + +Although we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. We pick a subset of our data and focus only on the top **20** of the Fortune 500 from the full list. + +Let's begin by installing some of libraries we will use for this excercise as follows, + + +```python +# sudo apt install libcurl4-openssl-dev libssl-dev +!pip install wptools +!pip install wikipedia +!pip install wordcloud +``` + +Importing the same, + + +```python +import json +import wptools +import wikipedia +import pandas as pd + +print('wptools version : {}'.format(wptools.__version__)) # checking the installed version +``` + + wptools version : 0.4.17 + + +Now let's load the data which we scrapped in the previous section as follows, + + +```python +# If you dont have the file, you can use the below code to fetch it: +import urllib.request +url = 'https://raw.githubusercontent.com/MonashDataFluency/python-web-scraping/master/data/fortune_500_companies.csv' +urllib.request.urlretrieve(url, 'fortune_500_companies.csv') +``` + + + + + ('fortune_500_companies.csv', ) + + + + +```python +fname = 'fortune_500_companies.csv' # scrapped data from previous section +df = pd.read_csv(fname) # reading the csv file as a pandas df +df.head() # displaying the first 5 rows +``` + +| | rank | company_name | company_website | +|---:|-------:|:-------------------|:---------------------------------| +| 0 | 1 | Walmart | http://www.stock.walmart.com | +| 1 | 2 | Exxon Mobil | http://www.exxonmobil.com | +| 2 | 3 | Berkshire Hathaway | http://www.berkshirehathaway.com | +| 3 | 4 | Apple | http://www.apple.com | +| 4 | 5 | UnitedHealth Group | http://www.unitedhealthgroup.com | + + +Let's focus and select only the top 20 companies from the list as follows, + + +```python +no_of_companies = 20 # no of companies we are interested +df_sub = df.iloc[:no_of_companies, :].copy() # only selecting the top 20 companies +companies = df_sub['company_name'].tolist() # converting the column to a list +``` + +Taking a brief look at the same, + + +```python +for i, j in enumerate(companies): # looping through the list of 20 company + print('{}. {}'.format(i+1, j)) # printing out the same +``` + + 1. Walmart + 2. Exxon Mobil + 3. Berkshire Hathaway + 4. Apple + 5. UnitedHealth Group + 6. McKesson + 7. CVS Health + 8. Amazon.com + 9. AT&T + 10. General Motors + 11. Ford Motor + 12. AmerisourceBergen + 13. Chevron + 14. Cardinal Health + 15. Costco + 16. Verizon + 17. Kroger + 18. General Electric + 19. Walgreens Boots Alliance + 20. JPMorgan Chase + + +### Getting article names from wiki + +Right off the bat, as you might have guessed, one issue with matching the top 20 Fortune 500 companies to their wikipedia article names is that both of them would not be exactly the same i.e. they match character for character. There will be slight variation in their names. + +To overcome this problem and ensure that we have all the company names and its corresponding wikipedia article, we will use the `wikipedia` package to get suggestions for the company names and their equivalent in wikipedia. + + +```python +wiki_search = [{company : wikipedia.search(company)} for company in companies] +``` + +Inspecting the same, + + +```python +for idx, company in enumerate(wiki_search): + for i, j in company.items(): + print('{}. {} :\n{}'.format(idx+1, i ,', '.join(j))) + print('\n') +``` + + 1. Walmart : + Walmart, History of Walmart, Criticism of Walmart, Walmarting, People of Walmart, Walmart (disambiguation), Walmart Canada, List of Walmart brands, Walmart Watch, 2019 El Paso shooting + + + 2. Exxon Mobil : + ExxonMobil, Exxon, Mobil, Esso, ExxonMobil climate change controversy, Exxon Valdez oil spill, ExxonMobil Building, ExxonMobil Electrofrac, List of public corporations by market capitalization, Exxon Valdez + + + 3. Berkshire Hathaway : + Berkshire Hathaway, Berkshire Hathaway Energy, List of assets owned by Berkshire Hathaway, Berkshire Hathaway Assurance, Berkshire Hathaway GUARD Insurance Companies, Warren Buffett, List of Berkshire Hathaway publications, The World's Billionaires, List of public corporations by market capitalization, David L. Sokol + + + 4. Apple : + Apple, Apple Inc., IPhone, Apple (disambiguation), IPad, Apple Silicon, IOS, MacOS, Macintosh, Fiona Apple + + + 5. UnitedHealth Group : + UnitedHealth Group, Optum, Pharmacy benefit management, William W. McGuire, Stephen J. Hemsley, Golden Rule Insurance Company, Catamaran Corporation, PacifiCare Health Systems, Gail Koziara Boudreaux, Amelia Warren Tyagi + + + 6. McKesson : + McKesson Corporation, DeRay Mckesson, McKesson Europe, Malcolm McKesson, Rexall (Canada), McKesson Plaza, McKesson (disambiguation), Johnetta Elzie, McKesson & Robbins scandal (1938), John Hammergren + + + 7. CVS Health : + CVS Health, CVS Pharmacy, CVS Health Charity Classic, CVS Caremark, Pharmacy benefit management, Larry Merlo, CVS, Encompass Health, Longs Drugs, MinuteClinic + + + 8. Amazon.com : + Amazon (company), History of Amazon, List of Amazon products and services, Prime Video, List of Amazon original programming, Amazon Web Services, Dot-com bubble, List of mergers and acquisitions by Amazon, Amazon S3, .amazon + + + 9. AT&T : + AT&T, AT&T Mobility, AT&T Corporation, AT&T TV, AT&T Stadium, T & T Supermarket, T, AT&T Communications, AT&T U-verse, AT&T SportsNet + + + 10. General Motors : + General Motors, History of General Motors, General Motors EV1, General Motors Vortec engine, Vauxhall Motors, GMC (automobile), General Motors 122 engine, General Motors 60° V6 engine, General Motors Chapter 11 reorganization, List of General Motors factories + + + 11. Ford Motor : + Ford Motor Company, History of Ford Motor Company, Lincoln Motor Company, Ford Trimotor, Henry Ford, Henry Ford II, Ford Foundation, Ford F-Series, Edsel Ford, Ford Germany + + + 12. AmerisourceBergen : + AmerisourceBergen, List of largest companies by revenue, Cardinal Health, Steven H. Collis, Ornella Barra, Good Neighbor Pharmacy, Family Pharmacy, PharMerica, Remdesivir, Michael DiCandilo + + + 13. Chevron : + Chevron Corporation, Chevron, Chevron (insignia), Philip Chevron, Chevron Cars Ltd, Chevron Cars, Chevron bead, Wound Chevron, Chevron (anatomy), Chevron Phillips Chemical + + + 14. Cardinal Health : + Cardinal Health, Cardinal, Catalent, Cardinal (TV series), Robert D. Walter, Dublin, Ohio, Northern cardinal, List of largest companies by revenue, Cordis (medical), George S. Barrett + + + 15. Costco : + Costco, W. Craig Jelinek, American Express, Price Club, James Sinegal, Rotisserie chicken, Jeffrey Brotman, Warehouse club, Richard Chang (Costco), Costco bear + + + 16. Verizon : + Verizon Communications, Verizon Wireless, Verizon Media, Verizon Fios, Verizon Building, Verizon Delaware, Verizon Business, 4G, Verizon Hub, Verizon Hum + + + 17. Kroger : + Kroger, Murder Kroger, Kroger (disambiguation), Chad Kroeger, Bernard Kroger, Michael Kroger, Stanley Kamel, Tonio Kröger, Rodney McMullen, List of Monk characters + + + 18. General Electric : + General Electric, General Electric GEnx, General Electric CF6, General Electric F110, General Electric F404, General Electric GE9X, General Electric GE90, General Electric J85, General Electric F414, General Electric Company + + + 19. Walgreens Boots Alliance : + Walgreens Boots Alliance, Alliance Boots, Walgreens, Boots (company), Alliance Healthcare, Stefano Pessina, Boots Opticians, Rite Aid, Ken Murphy (businessman), Gregory Wasson + + + 20. JPMorgan Chase : + JPMorgan Chase, Chase Bank, 2012 JPMorgan Chase trading loss, JPMorgan Chase Tower (Houston), 270 Park Avenue, Chase Paymentech, 2014 JPMorgan Chase data breach, Bear Stearns, Jamie Dimon, JPMorgan Chase Building (Houston) + + + + +Now let's get the most probable ones (the first suggestion) for each of the first 20 companies on the Fortune 500 list, + + +```python +most_probable = [(company, wiki_search[i][company][0]) for i, company in enumerate(companies)] +companies = [x[1] for x in most_probable] + +print(most_probable) +``` + + [('Walmart', 'Walmart'), ('Exxon Mobil', 'ExxonMobil'), ('Berkshire Hathaway', 'Berkshire Hathaway'), ('Apple', 'Apple'), ('UnitedHealth Group', 'UnitedHealth Group'), ('McKesson', 'McKesson Corporation'), ('CVS Health', 'CVS Health'), ('Amazon.com', 'Amazon (company)'), ('AT&T', 'AT&T'), ('General Motors', 'General Motors'), ('Ford Motor', 'Ford Motor Company'), ('AmerisourceBergen', 'AmerisourceBergen'), ('Chevron', 'Chevron Corporation'), ('Cardinal Health', 'Cardinal Health'), ('Costco', 'Costco'), ('Verizon', 'Verizon Communications'), ('Kroger', 'Kroger'), ('General Electric', 'General Electric'), ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'), ('JPMorgan Chase', 'JPMorgan Chase')] + + +We can notice that most of the wiki article titles make sense. However, **Apple** is quite ambiguous in this regard as it can indicate the fruit as well as the company. However we can see that the second suggestion returned by was **Apple Inc.**. Hence, we can manually replace it with **Apple Inc.** as follows, + + +```python +companies[companies.index('Apple')] = 'Apple Inc.' # replacing "Apple" +print(companies) # final list of wikipedia article titles +``` + + ['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'AmerisourceBergen', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon Communications', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase'] + + +### Retrieving the infoboxes + +Now that we have mapped the names of the companies to their corresponding wikipedia article let's retrieve the infobox data from those pages. + +`wptools` provides easy to use methods to directly call the MediaWiki API on our behalf and get us all the wikipedia data. Let's try retrieving data for **Walmart** as follows, + + +```python +page = wptools.page('Walmart') +page.get_parse() # parses the wikipedia article +``` + + en.wikipedia.org (parse) Walmart + en.wikipedia.org (imageinfo) File:Walmart store exterior 5266815680.jpg + Walmart (en) data + { + image: {'kind': 'parse-image', 'file': 'File:Walmart s... + infobox: name, logo, logo_caption, image, image_size,... + iwlinks: https://commons.wikimedia.org/wiki/Category:W... + pageid: 33589 + parsetree: