Install on Ubuntu/Mint:
sudo apt-get install python-twisted python-libxml2 python-simplejson
sudo pip install scrapy
Create an alias in ~/.bash_aliases for convenience: alias sa='scrapy'
.
Crawl website in shell:
$ scrapy shell http://myexpo.com/exhibition/32191.html
...
[1] name = response.xpath('//h1/text()').extract()[0]
[2] time = response.xpath('//div[@class="location"]/span[1]/text()').extract()[0].split('\r\n')[1].strip()
[3] addr = response.xpath('//div[@class="location"]/span[2]/text()').extract()[0].split('\r\n')[1].strip()
Crawl my blog
$ sa startproject myblog
Ref:
scrapy研究探索(二)——爬w3school.com.cn
Unicode character in Python 2.x
You have to add u'' prefix to a unicode string in Python 2.x.
Their raw value is in '\u' format.
To see them in human friendly format, use print()
function.
In [85]: ex = u'中国2015年3月'
In [86]: ex
Out[86]: u'\u4e2d\u56fd2015\u5e743\u6708'
In [87]: print(ex)
中国2015年3月
In [88]: ex.find(u'年')
Out[88]: 6
See more detailed contents about this topic in note "Unicode and File I/O in Python 2.X and 3.X".
XPath Grammar
Ref: XPath 语法