江鸟's Blog

python爬虫简单学习

字数统计: 3k阅读时长: 15 min
2019/06/11 Share

python爬虫简单学习,只能入个门

python爬虫简单学习

句子含义

  1. 导入模块

    1
    import requests
  2. 获取网页

    1
    2
    r = requests.get('https://api.github.com/events')
    Github 的公共时间线

总体功能的一个演示

1
2
3
4
5
6
7
8
9
10
import requests

response = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
print(response.content)
print(response.content.decode("utf-8"))

很多情况下的网站如果直接response.text会出现乱码的问题,所以这个使用response.content
这样返回的数据格式其实是二进制格式,然后通过decode()转换为utf-8,这样就解决了通过response.text直接返回显示乱码的问题.

请求

基本GET请求

1
2
3
4
import requests

response = requests.get('http://httpbin.org/get')
print(response.text)

带参数的GET

1
2
3
4
import requests

response = requests.get("http://httpbin.org/get?name=zhaofan&age=23")
print(response.text)

使用params关键字传递参数

1
2
3
4
5
6
7
8
9

import requests
data = {
"name":"zhaofan",
"age":22
}
response = requests.get("http://httpbin.org/get",params=data)
print(response.url)
print(response.text)

解析json

1
2
3
4
5
6
7
8
import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

requests里面集成的json其实就是执行了json.loads()方法,两者的结果是一样的

获取二进制数据

使用的response.content,这样获取的数据是二进制数据,同样的这个方法也可以用于下载图片以及
视频资源

1
2
3
4
5
6
import requests

response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico','ab')as f:
f.write(response.content)
f.close()

添加headers

以知乎为例,以下脚本访问会出错

1
2
3
import requests
response =requests.get("https://www.zhihu.com")
print(response.text)

因为访问知乎需要头部信息,这个时候我们在谷歌浏览器里输入chrome://version,就可以看到用户代理,将用户代理添加到头部信息

1
2
3
4
5
6
7
8
import requests
headers = {

"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response =requests.get("https://www.zhihu.com",headers=headers)

print(response.text)

post

1
2
3
4
5
6
7
8
import requests

data = {
"name":"zhaofan",
"age":23
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.text)

只修改了get和post 的区别

响应

可以通过response获得很多属性,例子如下

1
2
3
4
5
6
7
8
import requests

response = requests.get("http://www.baidu.com")
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)

requests高级用法

文件上传

实现方法和其他参数类似,也是构造一个字典然后通过files参数传递

1
2
3
4
import requests
files= {"files":open("git.jpeg","rb")}
response = requests.post("http://httpbin.org/post",files=files)
print(response.text)

获取cookie

1
2
3
4
5
6
7
import requests

response = requests.get("http://www.baidu.com")
print(response.cookies)

for key,value in response.cookies.items():
print(key+"="+value)

会话维持

cookie的一个作用就是可以用于模拟登陆,做会话维持

1
2
3
4
5
import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456")
response = s.get("http://httpbin.org/cookies")
print(response.text)

证书验证

现在的很多网站都是https的方式访问,所以这个时候就涉及到证书的问题

1
2
3
4
5
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)

代理设置

1
2
3
4
5
6
7
8
import requests

proxies= {
"http":"http://127.0.0.1:9999",
"https":"http://127.0.0.1:8888"
}
response = requests.get("https://www.baidu.com",proxies=proxies)
print(response.text)

如果代理需要设置账户名和密码,只需要将字典更改为如下:
proxies = { "http":"[http://user:password@127.0.0.1:9999](http://user:password@127.0.0.1:9999/)" }
如果你的代理是通过sokces这种方式则需要pip install “requests[socks]”
proxies= { "http":"socks5://127.0.0.1:9999", "https":"sockes5://127.0.0.1:8888" }

超时设置

通过timeout参数可以设置超时的时间

认证设置

如果碰到需要认证的网站可以通过requests.auth模块实现

1
2
3
4
5
6
import requests

from requests.auth import HTTPBasicAuth

response = requests.get("http://120.27.34.24:9001/",auth=HTTPBasicAuth("user","123"))
print(response.status_code)

或者

1
2
3
4
import requests

response = requests.get("http://120.27.34.24:9001/",auth=("user","123"))
print(response.status_code)

异常处理

关于reqeusts的异常在这里可以看到详细内容:

http://www.python-requests.org/en/master/api/#exceptions

所有的异常都是在requests.excepitons中

简单演示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests

from requests.exceptions import ReadTimeout,ConnectionError,RequestException


try:
response = requests.get("http://httpbin.org/get",timout=0.1)
print(response.status_code)
except ReadTimeout:
print("timeout")
except ConnectionError:
print("connection Error")
except RequestException:
print("error")

Beautiful Soup的用法

解析器

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
xml HTML解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库
lxml XML解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

lxml解析器有解析HTML和XML的功能,而且速度快,容错能力强,所以推荐使用它。

初始化Beautiful Soup

1
2
3
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)

基本用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') #使用lxml解析器解析完成后赋值给soup变量
print(soup.prettify()) #调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。即自动更正格式
print(soup.title.string) #输出HTML中title节点的文本内容

详细选择元素

1
2


输出

1
2
3
4
5
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> #当有多个节点时,这种选择方式只会选择到第一个匹配的节点
  1. 获取名称

    1
    print(soup.title.name)
    1
    title
  2. 获取属性

    每个节点可能有多个属性,比如id和class等,选择这个节点元素后,可以调用attrs获取所有属性:

    1
    2
    print(soup.p.attrs)
    print(soup.p.attrs['name'])
    1
    2
    {'class': ['title'], 'name': 'dromouse'}
    dromouse

    简写:

    1
    2
    print(soup.p['name'])
    print(soup.p['class'])
  3. 关联选择

    返回p的直接子节点

    1
    2
    3
    4
    5
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
    print(i, child)

    返回p的所有子孙节点

    1
    2
    3
    4
    5
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):
    print(i, child)

    父节点

    1
    2
    3
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)

    所有祖先节点

    1
    2
    3
    soup = BeautifulSoup(html, 'lxml')
    #print(type(soup.a.parents))
    print(list(enumerate(soup.a.parents)))
  4. 兄弟节点

    1
    2
    3
    4
    5
    6
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print('Next Sibling', soup.a.next_sibling)
    print('Prev Sibling', soup.a.previous_sibling)
    print('Next Siblings', list(enumerate(soup.a.next_siblings)))
    print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

    *其中next_sibling和previous_sibling分别获取节点的下一个和上一个兄弟元素,next_siblings和previous_siblings则分别返回所有前面和后面的兄弟节点的生成器。 *

  5. 提取信息

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    html = """
    <html>
    <body>
    <p>
    <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    </p>
    <p>
    """

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print('Next Sibling:')
    print(type(soup.a.next_sibling))
    print(soup.a.next_sibling)
    print(soup.a.next_sibling.string)
    print('Parent:')
    print(type(soup.a.parents))
    print(list(soup.a.parents)[0])
    print(list(soup.a.parents)[0].attrs['class'])

    结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    Next Sibling:
    <class 'bs4.element.Tag'>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    Lacie
    Parent:
    <class 'generator'>
    <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    </p>
    ['story']
  6. 方法选择器

    find_all()

    1
    find_all(name , attrs , recursive , text , **kwargs)
    1
    2
    3
    4
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(name='ul'))
    print(type(soup.find_all(name='ul')[0]))

    嵌套查询

    1
    2
    3
    4
    for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
    print(li.string)

    attrs

    1
    2
    3
    4
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id': 'list-1'}))
    print(soup.find_all(attrs={'name': 'elements'}))

    text

    text参数可用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象

    1
    2
    3
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text=re.compile('link')))
    1
    2
    3
    4
    5
    find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点。 
    find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。
    find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。
    find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
    find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
  7. css选择器

    1
    2
    3
    4
    5
    6
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.select('.panel .panel-heading'))
    print(soup.select('ul li'))
    print(soup.select('#list-2 .element'))
    print(type(soup.select('ul')[0]))

selenium

声明浏览器对象

1
2
3
4
from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()

访问页面

1
2
3
4
5
6
7
from selenium import webdriver

browser = webdriver.Chrome()

browser.get("http://www.baidu.com")
print(browser.page_source)
browser.close()

查找单个元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from selenium import webdriver

browser = webdriver.Chrome()

browser.get("http://www.taobao.com")
input_first = browser.find_element_by_id("q")
input_second = browser.find_element_by_css_selector("#q")
input_third = browser.find_element_by_xpath('//*[@id="q"]')
print(input_first)
print(input_second)
print(input_third)
browser.close()


这里列举一下常用的查找元素方法:

find_element_by_name
find_element_by_id
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

比较通用的一种

1
2
3
4
5
6
7
8
9
10
from selenium import webdriver

from selenium.webdriver.common.by import By

browser = webdriver.Chrome()

browser.get("http://www.taobao.com")
input_first = browser.find_element(By.ID,"q")
print(input_first)
browser.close()

查找多个元素

1
2
3
4
5
6
7
8
from selenium import webdriver


browser = webdriver.Chrome()
browser.get("http://www.taobao.com")
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()
CATALOG
  1. 1. python爬虫简单学习
    1. 1.1. 句子含义
    2. 1.2. 总体功能的一个演示
      1. 1.2.1. 请求
      2. 1.2.2. 基本GET请求
      3. 1.2.3. 带参数的GET
      4. 1.2.4. 解析json
      5. 1.2.5. 获取二进制数据
      6. 1.2.6. 添加headers
      7. 1.2.7. post
      8. 1.2.8. 响应
    3. 1.3. requests高级用法
      1. 1.3.1. 文件上传
      2. 1.3.2. 获取cookie
      3. 1.3.3. 会话维持
      4. 1.3.4. 证书验证
      5. 1.3.5. 代理设置
      6. 1.3.6. 超时设置
      7. 1.3.7. 认证设置
      8. 1.3.8. 异常处理
    4. 1.4. Beautiful Soup的用法
    5. 1.5. selenium