Ruby用Nokogiri解析提取HTML

Posted by baicai on December 5, 2015

文章介绍了如何使用Nokogiri解析HTML,并提取其中的内容。

Nokogiri

Nokogiri是一个能快速解析html和xml的gem库

gem install nokogiri
require 'nokogiri'

打开一个页面可以有多种方法

require 'nokogiri'
require 'open-uri'
require 'restclient'

page = Nokogiri::HTML(open("index.html"))   
page = Nokogiri::HTML(open("http://en.wikipedia.org/"))  
page = Nokogiri::HTML("<html>xxxxxxxxx</html>")  
page = Nokogiri::HTML(RestClient.get("http://en.wikipedia.org/"))

Nokogiri 和 CSS selectors

1 2 3

一些常见用法

page = Nokogiri::HTML(open(PAGE_URL))
puts page.css("title")[0].name   # => title
puts page.css("title")[0].text   # => My webpage
puts page.css("title").text   # => My webpage

<a href="http://www.google.com">Click here</a>
# set URL to point to where the page exists
page = Nokogiri::HTML(open(PAGE_URL))
links = page.css("a")
puts links.length   # => 6
puts links[0].text   # => Click here
puts links[0]["href"] # => http://www.google.com

page = Nokogiri::HTML(open(PAGE_URL))
news_links = page.css("a").select{|link| link['data-category'] == "news"}
news_links.each{|link| puts link['href'] }
#=>   http://reddit.com
#=>   http://www.nytimes.com        
puts news_links.class   #=>   Array 

news_links = page.css("a[data-category=news]")
news_links.each{|link| puts link['href']}
#=>   http://reddit.com
#=>   http://www.nytimes.com
puts news_links.class   #=>   Nokogiri::XML::NodeSet   

page.css('p').css("a[data-category=news]").css("strong")

#the_id_name_here .the_classname_here

.用来查找class

#用来查找id

最终结果可能是这样:

doc = Nokogiri::HTML(open("http://www.liubaicai.net/archives/573.html"))
title = doc.css('h1.header-post-title-class').first.content
content = doc.css('div#content').first
puts content