爬虫基础知识点
爬虫技术的用途
- 加深对操作系统/底层/网格/协议的认知;
- 安全漏洞扫描(ZoomEye, DarkNet, Deep Web);
- 投票工具、刷榜工具;
- 抢票软件;
- 垃圾评论;
- 搜索引擎;
- 垂直领域资讯和其它想要批量获取的资讯;
- 获取机器学习使用的语料;
- APP 冷启动数据;
HTTP
- TCP/IP 五层结构 [1];
- HTTP 1.0 vs. HTTP 1.1 [2];
- HTTP 2.0
- URI: URL, URN;
- URL 保留字符
- URL 格式:scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
- Methods: GET, POST;
- Headers:
- Host
- User-Agent
- Referer
- Content-Type
- Content-Length
- Content-Encoding & Accept-Encoding - 内容编码
- Transfer-Encoding & TE - 传输编码
- Content-Range
- Last-Modified & ETag
- 代理(Proxy)、缓存(Cache)、网关(Gateway)、隧道(Tunnel)、Agent
- Proxy 分类:Forward Proxy, Reverse Proxy, etc.
- Web 代理:GET https://www.baidu.com/search HTTP/1.0
- Web 隧道:HTTP CONNECT
- Web 隧道 vs. Web 代理
- MIME: application/x-www-url-encoded, multipart/form-data;
- Cookie & Session;
思考题
- 如何确认文件下载完成?
- HTTP 交互过程中可能碰到哪些异常?
- 在使用代理请求 http://www.baidu.com/search 时,域名 www.baidu.com 由本机还是代理服务器完成 DNS 解析的?
- 在使用代理请求 https://www.baidu.com/search 时,是否需要本机支持 SSL?
- Web 隧道和 Web 代理的异同?
[1] | Internet protocol suite |
[2] | Key Differences between HTTP/1.0 and HTTP/1.1 |
HTML & XML
- 基础标签;
- Declaring Character Enconding [3]
- HTTP Content-Type: text/html; charset=utf-8
- HTML <meta> tag and its attributes:
- charset
- http-equiv="Content-Type" content="text/html; charset=utf-8"
- XML: <?xml version="1.0" encoding="utf-8"?>
- CSS: @charset "utf-8"
- 网页跳转
- Javascript 检测 User-Agent 跳转
- <meta http-equiv="refresh" content="5"; url="http://www.example.com">
- <body onload="window.location = 'http://example.com/'">
- XPath [4]
- Regular Expression [5]
- HTML DOM [6]
- AJAX [7]
- Headless Browser - A web browser without a graphical user interface.
- PhantomJS
- Splash
- Google [8]
Encoding & Charset
- Base64, URL Encoding (Percent-encoding);
- 信息摘要算法与加密算法;
- CJK, GBK, GB2312, UTF8 vs. Python Unicode;
思考题
- 如何确定 HTTP 响应数据的字符集?
Python
- 并发模式 及 Twisted、Gevent、Tornado、asyncio ;
- HTTP 客户端:requests, urllib2, httplib2, treq ;
- 内容提取:re,BeautifulSoul, lxml, html5lib, pyquery, xmltodict ;
- 爬虫框加:scrapy, cola, pyspider, portia ;
- 数据存储:redispy, pymongo, PyMongo, MySQLdb ;
- Headless Browser: Splash, PyQT4, PySide, phantomjs, Selenium ;
- 数据去重算法:
- Bloom Filter - A space-efficient probabilistic data structure, that is used to test whether an element is a member of a set. See algorithm description for detailed information and there is a fast, simple, scalable, correct implementation for Python.
- SimHash - A technique for quickly estimating how similar two sets are. This algorithm is used by the Google Crawler to find near duplicate pages. See the original paper, and the paper from Google. There is also an efficient implementation for Python, and a pure python implementation.
思考题
- Python 有哪些用于实现并发的技术,各有哪些优缺点?
- 和 gevent/Tornado 搭配使用的第三方库的选择条件有哪些?
Scrapy
100 Hours
框架组件部件和工作流程 [14]
- 如何实现一个爬虫
创建项目 carnie
$ scrapy startproject carnie $ scrapy genspider testspider
添加爬虫代码 carnie/spiders/testspider.py
from scrapy import Request, Spider class TestSpider(Spider): name = "testspider" start_urls = ["http://www.baidu.com"] def parse(self, response): print response.body
运行爬虫:
$ scrapy crawl testspider
如何使用代理
yield Request(url, meta={"http": "http://proxyhost:proxyport/"})
- 可扩展组件:
- Downloader middleware
- Spider middleware
- Item Pipeline
- Extension
- 如何定时运行:
- 使用 crontab
- 使用 Scrapyd
- 使用 Scrapy Cloud
相关项目
- Scrapyely - A library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
- Portia - A tool that allows you to visually scrape websites without any programming knowledges required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.
- Splash - A Javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.
- Scrapyrt - A HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.
- Crawlera - It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don't have to.
思考题
- 如何开发能充分利用 Scrapy 并发能力的组件?
- 如何在 Scrapy 运行过程中使用代码停止 Scrapy?
- 如何将 Scrapy 改造成分布式架构?(持续运行、任务调度和分发、结果汇总、状态监控)
- 如何提升开发效率?(模板、可视化,Instance Based Learning algorithm)
- 如何即时了解爬虫的运行状态?
[11] | Scrapy documentation |
[12] | Scrapy community |
[13] | Scrapy tutorial |
[14] | (1, 2) Architecture overview |
Comments
不要轻轻地离开我,请留下点什么...