开发技术 - 爬虫基础知识点

本文目录

爬虫基础知识点
- 爬虫技术的用途
- HTTP
  - 思考题
- HTML & XML
  - 思考题
- Encoding & Charset
  - 思考题
- 爬虫功能点
  - 思考题
- Python
  - 思考题
- Scrapy
  - 思考题

爬虫基础知识点

爬虫技术的用途

加深对操作系统/底层/网格/协议的认知；
安全漏洞扫描（ZoomEye, DarkNet, Deep Web）；
投票工具、刷榜工具；
抢票软件；
垃圾评论；
搜索引擎；
垂直领域资讯和其它想要批量获取的资讯；
获取机器学习使用的语料；
APP 冷启动数据；

HTTP

TCP/IP 五层结构 [1]；
HTTP 1.0 vs. HTTP 1.1 [2];
HTTP 2.0
URI: URL, URN;
- URL 保留字符
- URL 格式：scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Methods: GET, POST;
Headers:
- Host
- User-Agent
- Referer
- Content-Type
- Content-Length
- Content-Encoding & Accept-Encoding - 内容编码
- Transfer-Encoding & TE - 传输编码
- Content-Range
- Last-Modified & ETag
代理（Proxy）、缓存（Cache）、网关（Gateway）、隧道（Tunnel）、Agent
- Proxy 分类：Forward Proxy, Reverse Proxy, etc.
- Web 代理：GET https://www.baidu.com/search HTTP/1.0
- Web 隧道：HTTP CONNECT
- Web 隧道 vs. Web 代理
MIME: application/x-www-url-encoded, multipart/form-data;
Cookie & Session;

思考题

如何确认文件下载完成？
HTTP 交互过程中可能碰到哪些异常？
在使用代理请求 http://www.baidu.com/search 时，域名 www.baidu.com 由本机还是代理服务器完成 DNS 解析的？
在使用代理请求 https://www.baidu.com/search 时，是否需要本机支持 SSL？
Web 隧道和 Web 代理的异同？

[1]	Internet protocol suite

[2]	Key Differences between HTTP/1.0 and HTTP/1.1

HTML & XML

基础标签;
Declaring Character Enconding [3]
- HTTP Content-Type: text/html; charset=utf-8
- HTML <meta> tag and its attributes:
  
  charset
  
  http-equiv="Content-Type" content="text/html; charset=utf-8"
- XML: <?xml version="1.0" encoding="utf-8"?>
- CSS: @charset "utf-8"
网页跳转
- Javascript 检测 User-Agent 跳转
- <meta http-equiv="refresh" content="5"; url="http://www.example.com">
- <body onload="window.location = 'http://example.com/'">
XPath [4]
Regular Expression [5]
HTML DOM [6]
AJAX [7]
Headless Browser - A web browser without a graphical user interface.
- PhantomJS
- Splash
- Google [8]

思考题

如何判断网页的字符编码？
如何抓取网页内容动态生成的网页？

[3]	Declaring Character Encoding

[4]	XPath on Wikipedia

[5]	regular-expressions.info

[6]	Document Object Model

[7]

Ajax

[8]	Deprecating our AJAX crawling scheme

Encoding & Charset

Base64, URL Encoding (Percent-encoding);
信息摘要算法与加密算法；
CJK, GBK, GB2312, UTF8 vs. Python Unicode;

思考题

如何确定 HTTP 响应数据的字符集？

爬虫功能点

robots.txt [10]
自动登录、Cookie/Session 维护；
代理维护；
网络超时处理；
动态页面内容提取；
AJAX 响应内容提取；
链接去重（Bloom Filter, Hashmap，RBTree）；
广度和深度爬取算法；
并发模型；
伸缩性；
异常处理；
灵活配置和扩展；
Spider Trap [9] 识别；
页面隐藏噪音识别；
如何避免被屏；
如何提取和存储数据；

思考题

爬取一个页面过程中，可能会碰到哪些网络相关的异常处理？
如何防止爬虫对某些数据进行抓取？

[9]	Spider trap

[10]	Robots.txt

Python

并发模式及 Twisted、Gevent、Tornado、asyncio ；
HTTP 客户端：requests, urllib2, httplib2, treq ；
内容提取：re，BeautifulSoul, lxml, html5lib, pyquery, xmltodict ；
爬虫框加：scrapy, cola, pyspider, portia ；
数据存储：redispy, pymongo, PyMongo, MySQLdb ；
Headless Browser: Splash, PyQT4, PySide, phantomjs, Selenium ；
数据去重算法：
- Bloom Filter - A space-efficient probabilistic data structure, that is used to test whether an element is a member of a set. See algorithm description for detailed information and there is a fast, simple, scalable, correct implementation for Python.
- SimHash - A technique for quickly estimating how similar two sets are. This algorithm is used by the Google Crawler to find near duplicate pages. See the original paper, and the paper from Google. There is also an efficient implementation for Python, and a pure python implementation.

思考题

Python 有哪些用于实现并发的技术，各有哪些优缺点？
和 gevent/Tornado 搭配使用的第三方库的选择条件有哪些？

Scrapy

为什么选择 Scrapy:
- 优秀的文档 [11]
- 活跃的社区 [12]
- 入门难度低 [13]
- 可扩展性强 [14]
- 并发能力强
100 Hours
框架组件部件和工作流程 [14]

如何实现一个爬虫

创建项目 carnie

$ scrapy startproject carnie
$ scrapy genspider testspider

添加爬虫代码 carnie/spiders/testspider.py

from scrapy import Request, Spider

class TestSpider(Spider):
    name = "testspider"
    start_urls = ["http://www.baidu.com"]

    def parse(self, response):
        print response.body

运行爬虫:
```
$ scrapy crawl testspider
```

如何使用代理

yield Request(url, meta={"http": "http://proxyhost:proxyport/"})

可扩展组件：
- Downloader middleware
- Spider middleware
- Item Pipeline
- Extension
如何定时运行：
- 使用 crontab
- 使用 Scrapyd
- 使用 Scrapy Cloud
扩展组件参考代码
- Scrapy 源代码
- Scrapy 官方插件仓库
相关项目
- Scrapyely - A library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
- Portia - A tool that allows you to visually scrape websites without any programming knowledges required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.
- Splash - A Javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.
- Scrapyrt - A HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.
- Crawlera - It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don't have to.

思考题

如何开发能充分利用 Scrapy 并发能力的组件？
如何在 Scrapy 运行过程中使用代码停止 Scrapy？
如何将 Scrapy 改造成分布式架构？（持续运行、任务调度和分发、结果汇总、状态监控）
如何提升开发效率？（模板、可视化，Instance Based Learning algorithm）
如何即时了解爬虫的运行状态？

[11]	Scrapy documentation

[12]	Scrapy community

[13]	Scrapy tutorial

[14]	(1, 2) Architecture overview

不要轻轻地离开我，请留下点什么...

Comments

开发技术 - 爬虫基础知识点

爬虫基础知识点

爬虫技术的用途

HTTP

思考题

HTML & XML

思考题

Encoding & Charset

思考题

爬虫功能点

思考题

Python

思考题

Scrapy

思考题

Comments

Published

Category

Tags

Contact