DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Scrapy Notes


Install on Ubuntu/Mint:

sudo apt-get install python-twisted python-libxml2 python-simplejson
sudo pip install scrapy

Create an alias in ~/.bash_aliases for convenience: alias sa='scrapy'.

Crawl website in shell:

$ scrapy shell http://myexpo.com/exhibition/32191.html
...
[1] name = response.xpath('//h1/text()').extract()[0]
[2] time = response.xpath('//div[@class="location"]/span[1]/text()').extract()[0].split('\r\n')[1].strip()
[3] addr = response.xpath('//div[@class="location"]/span[2]/text()').extract()[0].split('\r\n')[1].strip()

Crawl my blog

$ sa startproject myblog

Ref:

scrapy研究探索(二)——爬w3school.com.cn

Unicode character in Python 2.x

You have to add u'' prefix to a unicode string in Python 2.x. Their raw value is in '\u' format. To see them in human friendly format, use print() function.

In [85]: ex = u'中国2015年3月'

In [86]: ex
Out[86]: u'\u4e2d\u56fd2015\u5e743\u6708'

In [87]: print(ex)
中国2015年3月

In [88]: ex.find(u'年')
Out[88]: 6

See more detailed contents about this topic in note "Unicode and File I/O in Python 2.X and 3.X".

XPath Grammar

Ref: XPath 语法



Published

Mar 30, 2015

Last Updated

Mar 30, 2015

Category

Tech

Tags

  • scrapy 1
  • unicode 10
  • web crawler 1
  • xpath 2

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor