browser-extension

xpath generator 是如何实现的?

写爬虫的话做到最后基本上最终没法自动化的就是指定要抽取的元素的xpath了, 要定向爬一个网站的内容基本上都会归结到去找下一页和数据元素的xpath. 如果能把xpath的生成交给不会写程序的运营同学来做的话, 能够极大地解放程序员的生产力.

毕竟xpath也算是一个DSL, 对于不会编程的同学还是有一定难度的. SQL写得熟练的PM多得是, 想找一个会写xpath的运营同学则是很困难, 毕竟术业有专攻, 运营需要面对的问题和我们程序猿还是有很大不同. 多年的经验, 感觉能教会他们yaml已经是极限了…

那么能不能有一个图形化的工具来生成xpath呢? 答案显然是有的, chrome浏览器就内置了生成xpath的工具, 如下图所示:

chrome xpath

这幅图生成的xpath是: //*[@id="fc_B_pic"]/ul[1]/li[1]/a[1]

然而chrome的xpath生成却有几个缺点:

  1. chrome的xpath只会想上去找带有id的元素, 而根据实际的情况, 往往找到带有class的元素就可以保证找的xpath是对的了.
  2. chrome生成的元素是尽量保证元素唯一的, 也就是当你想要搞一个能能够选中多个元素的xpath时, chrome 无能为力, 还是需要自己去改写.
  3. 另外就是生成之后不能方便的用图形工具去验证.

未完待续

steal focus from chrome omnibox on new tab

chrome set focus to the omni box when you create a new tab, although there is an api to replace the new tab page. you can’t steal the focus from the omni box in the new tab page simply. there are two work-arounds.

if you are creating a new tab programmatically

https://stackoverflow.com/questions/42178723/chrome-extension-creating-new-tab-and-taking-focus-to-page

if you are creating a new tab by click new tab button

https://stackoverflow.com/questions/17598778/how-to-steal-focus-from-the-omnibox-in-a-chrome-extension-on-the-new-tab-page

Chrome Extension Tabs

permissions

permissions: [
    "tabs",
]

usage

chrome.tabs.query to get current tab

chrome.tabs.query({active: true, currentWindow: true}, function(tabs) {});  // tabs[0] would be the current tab

create new tab

chrome.tabs.create({url: URL}, function(tab) {})

kill tab

chrome.tabs.remove(tabId or [tabId], function() {})

Chrome Extension 存储

Basic Concepts

there are 3 storage area for chrome, sync, local, managed areas. the sync area will be synced with the cloud. managed area is read-only.

all your extension scripts share the same storage, including content scripts, they don’t belong to their domain’s localStorage.

Usage

chrome.storage.local.get('key', function(data) {});
chrome.storage.local.get(["KEY1", "KEY2"], function(data) {});

chrome.storage.local.set(data, function() {});  // data is key-value pair to store

chrome.storage.local.remove('key', function() {});
chroem.storage.local.remove(["KEY1", "KEY2"], function() {});
chrome.storage.local.clear(function() {});

Events

Chrome extension cookies

permissions

set the cookies permission and the domain you would like to access cookies.

"permissions": {
    "cookies",
    "*://*.example.com/"
}

type

cookie

just a simple object with {name, value, domain...}

CookieStore

normal mode and incognito mode use different cookie stores.

read

get: chrome.cookies.get({url: URL, name: COOKIE_NAME, storeId: COOKIE_STORE_ID}, function(cookie) {})

get all: chrome.cookies.get({domain: DOMAIN}, function(cookies) {}) NOTE: there are other filters not listed here.

set: chrome.cookies.set({url, name, value}, function(cookie) {}) if failed, the callback gets null

Chrome 扩展插件开发

A chrome extension can inject script into the page, this is called content script.

https://developer.chrome.com/extensions/getstarted
https://developer.chrome.com/extensions/content_scripts
https://developer.chrome.com/extensions/messaging

图标变灰的问题

Add browseraction.defaulticon in your manifest.json file

{
  ...

  "browser_action": {
    "default_icon": "icons/icon-32.png";
  },

  ...
}

学习 greasemonkey 教程

GreaseMonkey/TamperMonkey 学习

头部命令

  • @name | 脚本名字|
  • @namespace|命名空间|
  • @version| 版本|
  • @author|作者|
  • @description
  • @homepage
  • @icon
  • @updateURL
  • @downloadURL
  • @include
  • @exclude
  • @resource key url
  • @require include scripts
  • @connect reach cross origin domains self, current domain, localhost, or *
  • @run-at when to run the script document-start/document-body/document-end/document-idle/context-menu
  • @grant whitelist GM_* functions If no @grant tag is given TM guesses the scripts needs.

函数

GM_addStyle(css)        
GM_get/set/deleteValue      
GM_listValues()     
GM_getResourceText(name)        
GM_getResourceURL(name) get base64 encoded urI  
GM_openInTab(url)       
GM_getTab(cb)   Get a object that is persistent as long as this tab is open.    
GM_getTabs(cb)  Get all tab objects as a hash to communicate with other script instances.   
GM_setClipboard(data, info) set the clipboard   


GM_xmlhttprequest can do cross domain request

using it in $.ajax https://gist.github.com/yifeikong/9e93cc38297cce989ffbef5587ad2f39