文章介绍了如何使用Nokogiri解析HTML,并提取其中的内容。
Nokogiri
Nokogiri是一个能快速解析html和xml的gem库
gem install nokogiri
require 'nokogiri'
打开一个页面可以有多种方法
require 'nokogiri' require 'open-uri' require 'restclient' page = Nokogiri::HTML(open("index.html")) page = Nokogiri::HTML(open("http://en.wikipedia.org/")) page = Nokogiri::HTML("<html>xxxxxxxxx</html>") page = Nokogiri::HTML(RestClient.get("http://en.wikipedia.org/"))
Nokogiri 和 CSS selectors
一些常见用法
page = Nokogiri::HTML(open(PAGE_URL)) puts page.css("title")[0].name # => title puts page.css("title")[0].text # => My webpage puts page.css("title").text # => My webpage <a href="http://www.google.com">Click here</a> # set URL to point to where the page exists page = Nokogiri::HTML(open(PAGE_URL)) links = page.css("a") puts links.length # => 6 puts links[0].text # => Click here puts links[0]["href"] # => http://www.google.com page = Nokogiri::HTML(open(PAGE_URL)) news_links = page.css("a").select{|link| link['data-category'] == "news"} news_links.each{|link| puts link['href'] } #=> http://reddit.com #=> http://www.nytimes.com puts news_links.class #=> Array news_links = page.css("a[data-category=news]") news_links.each{|link| puts link['href']} #=> http://reddit.com #=> http://www.nytimes.com puts news_links.class #=> Nokogiri::XML::NodeSet page.css('p').css("a[data-category=news]").css("strong")
#the_id_name_here .the_classname_here
.用来查找class
#用来查找id
最终结果可能是这样:
doc = Nokogiri::HTML(open("http://www.liubaicai.net/archives/573.html")) title = doc.css('h1.header-post-title-class').first.content content = doc.css('div#content').first puts content