xpath generator 是如何实现的?

写爬虫的话做到最后基本上最终没法自动化的就是指定要抽取的元素的 xpath 了,要定向爬一个网站的内容基本上都会归结到去找下一页和数据元素的 xpath. 如果能把 xpath 的生成交给不会写程序的运营同学来做的话,能够极大地解放程序员的生产力。

毕竟 xpath 也算是一个 DSL, 对于不会编程的同学还是有一定难度的。SQL 写得熟练的 PM 多得是,想找一个会写 xpath 的运营同学则是很困难,毕竟术业有专攻,运营需要面对的问题和我们程序猿还是有很大不同。多年的经验,感觉能教会他们 yaml 已经是极限了…

那么能不能有一个图形化的工具来生成 xpath 呢?答案显然是有的,chrome 浏览器就内置了生成 xpath 的工具,如下图所示:

chrome xpath

这幅图生成的 xpath 是://*[@id="fc_B_pic"]/ul[1]/li[1]/a[1]

然而 chrome 的 xpath 生成却有几个缺点:

  1. chrome 的 xpath 只会想上去找带有 id 的元素,而根据实际的情况,往往找到带有 class 的元素就可以保证找的 xpath 是对的了。
  2. chrome 生成的元素是尽量保证元素唯一的,也就是当你想要搞一个能能够选中多个元素的 xpath 时,chrome 无能为力,还是需要自己去改写。
  3. 另外就是生成之后不能方便的用图形工具去验证。


steal focus from chrome omnibox on new tab

chrome set focus to the omni box when you create a new tab, although there is an api to replace the new tab page. you can’t steal the focus from the omni box in the new tab page simply. there are two work-arounds.

if you are creating a new tab programmatically

if you are creating a new tab by click new tab button

Chrome Extension Tabs


permissions: [


chrome.tabs.query to get current tab

chrome.tabs.query({active: true, currentWindow: true}, function(tabs) {});  // tabs[0] would be the current tab

create new tab

chrome.tabs.create({url: URL}, function(tab) {})

kill tab

chrome.tabs.remove(tabId or [tabId], function() {})

Chrome Extension 存储

Basic Concepts

there are 3 storage area for chrome, sync, local, managed areas. the sync area will be synced with the cloud. managed area is read-only.

all your extension scripts share the same storage, including content scripts, they don’t belong to their domain’s localStorage.

Usage'key', function(data) {});["KEY1", "KEY2"], function(data) {});, function() {});  // data is key-value pair to store'key', function() {});["KEY1", "KEY2"], function() {}); {});


Chrome extension cookies


set the cookies permission and the domain you would like to access cookies.

"permissions": {



just a simple object with {name, value, domain...}


normal mode and incognito mode use different cookie stores.


get: chrome.cookies.get({url: URL, name: COOKIE_NAME, storeId: COOKIE_STORE_ID}, function(cookie) {})

get all: chrome.cookies.get({domain: DOMAIN}, function(cookies) {}) NOTE: there are other filters not listed here.

set: chrome.cookies.set({url, name, value}, function(cookie) {}) if failed, the callback gets null

Chrome 扩展插件开发

A chrome extension can inject script into the page, this is called content script.


Add browseraction.defaulticon in your manifest.json file


  "browser_action": {
    "default_icon": "icons/icon-32.png";


学习 greasemonkey 教程

GreaseMonkey/TamperMonkey 学习


  • @name | 脚本名字|
  • @namespace|命名空间|
  • @version| 版本|
  • @author|作者|
  • @description
  • @homepage
  • @icon
  • @updateURL
  • @downloadURL
  • @include
  • @exclude
  • @resource key url
  • @require include scripts
  • @connect reach cross origin domains self, current domain, localhost, or *
  • @run-at when to run the script document-start/document-body/document-end/document-idle/context-menu
  • @grant whitelist GM_* functions If no @grant tag is given TM guesses the scripts needs.


GM_getResourceURL(name) get base64 encoded urI
GM_getTab(cb)   Get a object that is persistent as long as this tab is open.
GM_getTabs(cb)  Get all tab objects as a hash to communicate with other script instances.
GM_setClipboard(data, info) set the clipboard

GM_xmlhttprequest can do cross domain request

using it in $.ajax