show | version | enable_checker |
---|---|---|
step |
1.0 |
true |
- 上次真的爬了一个网站
- oeasy.org
- 右键检查元素
- 获取 xpath
- 爬取之后获得属性 href 的值
- 然后切片并拼接为绝对链接地址
- 并且存储到了 urls.txt 中
- 能把这些链接都爬一遍么?🤔
- 另存一个o2z.py
- 然后打开修改
import requests
from lxml import etree
with open("urls.txt") as f:
urls = f.readlines()
for url in urls:
response = requests.get(url)
b_html = response.content
et_html = etree.HTML(b_html)
et_targets = et_html.xpath(".")
for et_target in et_targets:
print(et_target.tag)
- 运行出现了 12 次 html 节点
- 右键
电路
下的电路基础链接- 复制 xpath
- 修改代码
- 按照地址
- 输出项目
- 有一点问题
- love这一页
- 与其他也不同
- 应该排除
- 使用slice切片的方式
- 输出的网址貌似可以直接在火狐里访问
- 但是爬虫爬到的都是404
- 把url头尾都加上中划线
- 可以明确
- 原来是网址后面多了个
\n
- 替换掉回车标记
- 除了课程名
- 还想保存教程网址
- 可以分别输出
- 打开链接后再存储
- 各教程名称
- 看教程网址
import requests
from lxml import etree
with open("urls.txt") as f:
urls = f.readlines()[1:]
for url in urls:
url = url.replace("\n","")
response = requests.get(url)
b_html = response.content
et_html = etree.HTML(b_html)
et_targets = et_html.xpath("/html/body/div/div/div[1]")
et_links = et_html.xpath("/html/body/div/div/div[2]/a[1]")
for num in range(len(et_targets)):
print(et_targets[num].text)
print(et_links[num].attrib["href"])
import requests
from lxml import etree
with open("urls.txt") as f:
urls = f.read().split("\n")[1:-1]
for url in urls:
response = requests.get(url)
s_html = response.content.decode("utf-8")
et_html = etree.HTML(s_html)
et_target = et_html.xpath("/html/body/div/div/div[1]")
et_target2 = et_html.xpath("/html/body/div/div/div[2]/a[1]")
print("url----"+url)
count = 0
while ( count < len(et_target)):
print(et_target[count].text)
print(et_target2[count].text)
print(et_target2[count].attrib["href"])
count = count + 1
print("count====" + str(count))
- 将urls.txt中得到的列表进行裁切[1:-1]
- 前面的1 排除了第0个
- 后面的-1 排除了最后一个
- 可以把各个网址输出到屏幕
- 现在需要把输出重定向到文本文件
import requests
from lxml import etree
with open("urls.txt") as f:
urls = f.read().split("\n")[1:-1]
for url in urls:
response = requests.get(url)
s_html = response.content.decode("utf-8")
et_html = etree.HTML(s_html)
et_target = et_html.xpath("/html/body/div/div/div[1]")
et_target2 = et_html.xpath("/html/body/div/div/div[2]/a[1]")
count = len(et_target)
with open("urls.csv","at") as f:
for cur in range(0,count):
print("cur====" + str(cur))
print(et_target[cur].text)
print(et_target2[cur].text)
print(et_target2[cur].attrib["href"])
s_url = et_target[cur].text + "," \
+ et_target2[cur].attrib["href"] + "\n"
f.write(s_url)
- 我们先爬的是首页
- 爬出很多链接
- 然后存到 urls.txt 的里面
- 然后遍历 urls 里面的每一个地址
- 得到课程的名字和链接
- 然后入库
- 我现在先把整个流程合在一起
- 做成一个 py 文件
- 更新系统安装 postgres
- 启动 postgres
- 并以 postgres 登录
sudo apt update
sudo apt install postgresql
sudo /etc/init.d/postgresql start
sudo -u postgres psql
- 查看所有数据库 database
\l
- 建立数据库 oeasy
- 并进入
create database oeasy;
\l
\c oeasy
\dt
- 建表
create table urls(topic varchar, url varchar);
\dt
- 导入数据
\copy urls from '/home/shiyanlou/urls.csv' delimiter ',' csv ;
- 查询数据
select * from urls where topic like '%e%';
select * from urls where length(topic) > '5';
- 这次真的爬了一个网站
- oeasy.org
- 右键检查元素
- 获取 xpath
- 爬取之后获得属性 href 的值
- 然后切片并拼接为绝对链接地址
- 并且把每一个链接都爬了一遍
- 能出去爬个百度么?🤔
- 下次再说