Month: 8月 2017

xpath generator 是如何实现的?

写爬虫的话做到最后基本上最终没法自动化的就是指定要抽取的元素的 xpath 了,要定向爬一个网站的内容基本上都会归结到去找下一页和数据元素的 xpath. 如果能把 xpath 的生成交给不会写程序的运营同学来做的话,能够极大地解放程序员的生产力。

毕竟 xpath 也算是一个 DSL, 对于不会编程的同学还是有一定难度的。SQL 写得熟练的 PM 多得是,想找一个会写 xpath 的运营同学则是很困难,毕竟术业有专攻,运营需要面对的问题和我们程序猿还是有很大不同。多年的经验,感觉能教会他们 yaml 已经是极限了…

那么能不能有一个图形化的工具来生成 xpath 呢?答案显然是有的,chrome 浏览器就内置了生成 xpath 的工具,如下图所示:

chrome xpath

这幅图生成的 xpath 是://*[@id="fc_B_pic"]/ul[1]/li[1]/a[1]

然而 chrome 的 xpath 生成却有几个缺点:

  1. chrome 的 xpath 只会想上去找带有 id 的元素,而根据实际的情况,往往找到带有 class 的元素就可以保证找的 xpath 是对的了。
  2. chrome 生成的元素是尽量保证元素唯一的,也就是当你想要搞一个能能够选中多个元素的 xpath 时,chrome 无能为力,还是需要自己去改写。
  3. 另外就是生成之后不能方便的用图形工具去验证。

未完待续

对头条的思考

读张一鸣微博

可假设某位下属提辞职,感受如是『很可惜』『很在意』那么应现在考虑如何给增加回报和空间,相反,如觉得『也好』『更好』甚至轻松,那应考虑是否做出调岗或辞退。

时常看看旧闻,就知道媒体多么不靠谱

大剑无锋大巧不工

在全网的上抓取分析提取信息,遇到的大问题就是 spam 以及 spam 的高级形式软文或者枪文,pangerank 是网页的投票,sns 利用可标识的人行为来投票,sns 平台之外还有好方法吗。

风景长宜放眼量。往事不可谏,来者犹可追。

他应该是一个类似张小龙一样,技术出身的优秀的产品经理

反思

阿里的月饼风波过后,一鸣好像说过:“我们中秋节就不发月饼,因为优秀的工程师不会在乎那一盒月饼的。” 其实这句话就非常傻,优秀的员工可能这样想:“一盒月饼才多少钱,公司连这都舍不得,在其他事情上会舍得为员工付出吗?”。月饼这种东西显然是 ROI 非常高的一个员工福利,所以不要为了彰显和阿里不同而故意说反话。当然,实际上头条的福利很好,不光有月饼,过年也有礼物。

Google 的价值观就很短:Don’t be evil. 以至于大家都知道。而百度的价值观也很短:简单可依赖。虽然不管是 Google 还是百度都没有很好地践行他们的价值观,但是至少起到了宣传作用吧。头条的价值观太长了,而且是五六个独立的词语,到现在我都记不下来。平均下来头条员工在头条的时间可能连两年都不到,你让人记这么长一大串东西,你又不是社会主义核心价值观。

十年学会编程

对编程产生感兴趣并因为乐趣而写程序。确信你自始至终都能乐在其中,这样你才愿意将十年光阴投入编程事业。

与其他程序员交流;阅读别人的代码。这比任何书任何培训都重要。

记住在 “计算机科学” 中包括 “计算机” 这个词。要知道你的计算机执行一条指令需要多久,到内存中取一个字需要多久(缓存是否击中), 到磁盘读取连续的字需要多久,而磁盘的定位又需要多久。

Fred Brooks (人月神话作者) 在他的文章 没有银弹 中指出,发掘卓越软体设计者的三部曲:

  1. 尽早尽可能地以系统化的方式发掘最佳设计人员。
  2. 给有潜力者指派生涯规划师,并谨慎地规划他们的职业生涯。
  3. 提供机会给正在成长的程序员,让他们能相互影响,彼此激励。

readability.js 源码阅读

Readability is able to fetch paginated page and combine them into one page
 

by functions

 

init

  1. start the whole process
  2. remove event listeners
  3. remove scripts
  4. find next page
  5. prep the document
  6. get readability article components
  7. get document directions
  8. add readability dom to the dom
  9. post process
  10. scroll to top
  11. append next page to the dom
  12. add some smooth scrolling function

parsedPages

stores the parsed pages, key are end-slash-striped

prepDocument

  1. create document.body if not have
  2. find the biggest frame (width + height)
  3. remove all css
  4. remove all style elements
  5. replace <br/><br> to </p><p>
     

    prepArticle

     

  6. Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.
  7. clean styles
  8. clean unwanted tags
  9. if only have one h2, that must be the title, but we already have title, so remove it
  10. remove empty <p> s
     

    getArticleTitle

     

  11. get document.title or h1
  12. normalizing the title
     

    killBreaks

     
    replace any break (<br/> ) by <br />
     

    cleanTags

     

  13. clean child tags of given element
  14. cleanConditionally
     

    getLinkDensity

     
    the amount of text that is inside a link divided by the total text in the node. archored text length / all text length
     

    grabArticle

     
    main logic for readability, using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.

get nodes to score:

  1. get all nodes
  2. remove unlikely candidates by find specific patterns in classname and id
  3. add p, td, pre to nodesToScore
  4. Turn all divs that don’t have children block level elements into p’s, and add it to nodesToScore

get candidates

  1. Loop through all paragraphs, and assign a score to them based on how content-y they look. Then add their score to their parent node.
  2. pass <25 char nodes
  3. initialize parent and grand parent nodes ?
  4. compute content score

    1. base score 1
    2. add score by comma numbers, note only english comma counted
    3. For every 100 characters in this paragraph, add another point. Up to 3 points.
    4. Add the score to the parent. The grandparent gets half.
       

      getCharCount

       
      Get the number of times a string s appears in the node e.
       

      htmlspecialchars

       
      replace <>&”‘ to safe strings
       

      flagisActive/addFlag/removeFlag

       
      check/ readability flags

removeScripts

 
remove all javascripts found one the page
 

getInnerText

 
trim and squeeze spaces and return the textContent of a node
 

cleanStyles

 
1. clean style attribute recursively

fixImageFloats

 
1. Some content ends up looking ugly if the image is too large to be floated.  If the image is wider than a threshold (currently 55%), no longer float it, center it instead.
 

postProcessContent

 
post processing: add footnotes, fix floating images
 

getArticleTools

 
1. get document.title or h1
2. normalizing the title
 

getSuggestedDirection

 

getArticleFooter

 
1. readability tracking script

addFootNotes

 
add links found in the page as foot notes
 

useRdbTypekit

 
nothing
 
xhr
 
xmlhttprequest
 
successfulrequest
 
ajax

findbaseUrl

 
find the articles base url, normalize and remove the paganation part
only the path part, no query string
 

findnextpage

 
1. find all links
2. if already seen the link or the link is the page self
3. if on different domain, ignore
4. if match the EXTRANEOUS regex or has a long text, remove it
5. if remove the base url, and have no number in it, remove it
 
ok, the logic is very good, just translate it to python
 

appendNextPage

合理的绩效观

Yifei 的想法

考核的的尺子一定要长,是为了:

在一个公司里,每天都会产生很多新的想法,并不是每一个想法都会落到实处,因为有的人没有时间做,有的想法纯属脑洞大开,有的想法有其他位置问题不可能实现,有的可能提出的人只有个创意,实施的人做出了花。

如果有重大的事,或者一定要做的事,要落到纸面上,有排期,这样才能真正做起来。当然如果做到了管理岗位,切忌把所有事都写下来,因为好大一部分是前面提到的做不做无所谓的事。要给每个人自由,才能让他们发挥到自己的长处。

所以,要用 okr,而不是 kpi。

阅读 instagram 的 python 升级文章

在 Instagram 的用户数迅速增长的过程中,性能问题还是出现了:服务器数量的增长率已经慢慢的超过了用户增长率。

为此,他们决定跳过 Python 2 中哪些蹩脚的异步 IO 实现 (可怜的 gevent、tornado、twisted 众),直接升级到 Python 3,去探索标准库中的 asyncio 模块所能带来的可能性。

在 Instagram,进行 Python 3 的迁移需要必须满足两个前提条件:

  • 不停机,不能有任何的服务因此不可用
  • 不能影响产品新特性的开发

Dropbox CEO’s speech

Dropbox CEO’s speech

Bill Gates’s first company made software for traffic lights. Steve Jobs’s first company made plastic whistles that let you make free phone calls. Both failed, but it’s hard to imagine they were too upset about it. That’s my favourite thing that changes today. You no longer carry around a number indicating the sum of all your mistakes. From now on, failure doesn’t matter: you only have to be right once.

So that’s how 30,000 ended up on the cheat sheet. That night, I realised there are no warmups, no practice rounds, no reset buttons. Every day we’re writing a few more words of a story. And when you die, it’s not like “here lies Drew, he came in 174th place.” So from then on, I stopped trying to make my life perfect, and instead tried to make it interesting. I wanted my story to be an adventure — and that’s made all the difference.

And today on your commencement, your first day of life in the real world, that’s what I wish for you. Instead of trying to make your life perfect, give yourself the freedom to make it an adventure, and go ever upward. Thank you.

It took me a while to get it, but the hardest-working people don’t work hard because they’re disciplined. They work hard because working on an exciting problem is fun. So after today, it’s not about pushing yourself; it’s about finding your tennis ball, the thing that pulls you. It might take a while, but until you find it, keep listening for that little voice.

Fortunately, it doesn’t matter. No one has a 5.0 in real life. In fact, when you finish school, the whole notion of a GPA just goes away. When you’re in school, every little mistake is a permanent crack in your windshield. But in the real world, if you’re not swerving around and hitting the guard rails every now and then, you’re not going fast enough. Your biggest risk isn’t failing, it’s getting too comfortable.

Honestly, I don’t think I’ve ever been “ready.” I remember the day our first investors said yes and asked us where to send the money. For a 24 year old, this is Christmas — and opening your present is hitting refresh over and over on bankofamerica.com and watching your company’s checking account go from 60 dollars to 1.2 million dollars. At first I was ecstatic — that number has two commas in it! I took a screenshot — but then I was sick to my stomach. Someday these guys are going to want this back. What the hell have I gotten myself into?

They say that you’re the average of the 5 people you spend the most time with. Think about that for a minute: who would be in your circle of 5? I have some good news: MIT is one of the best places in the world to start building that circle. If I hadn’t come here, I wouldn’t have met Adam, I wouldn’t have met my amazing cofounder, Arash, and there would be no Dropbox.

And now your circle will grow to include your coworkers and everyone around you. Where you live matters: there’s only one MIT. And there’s only one Hollywood and only one Silicon Valley. This isn’t a coincidence: for whatever you’re doing, there’s usually only one place where the top people go. You should go there. Don’t settle for anywhere else. Meeting my heroes and learning from them gave me a huge advantage. Your heroes are part of your circle too — follow them. If the real action is happening somewhere else, move.

One thing I’ve learned is surrounding yourself with inspiring people is now just as important as being talented or working hard. Can you imagine if Michael Jordan hadn’t been in the NBA, if his circle of 5 had been a bunch of guys in Italy? Your circle pushes you to be better, just as Adam pushed me.

I was thrilled for him, but it was a shock for me. Here was my faithful beer pong partner and my little brother in the fraternity, two years younger than me. I was out of excuses. He was off to the Super Bowl and I wasn’t even getting drafted. He had no idea at the time, but Adam had given me just the kick I needed. It was time for a change.

爬虫如何尽量模拟浏览器

http headers

 
发送 http 请求时,Host, Connection, Accept, User-Agent, Referer, Accept-Encoding, Accept-Language 这七个头必须添加,因为正常的浏览器都会有这 7 个头。
 
其中:

  1. Host 一般各种库都已经填充了
  2. Connection 填 Keep-Alive
  3. Accept 一般填 text/html 或者 application/json
  4. User-Agent 使用自己的爬虫或者伪造浏览器的 UA
  5. Referer 一般填当前 URL 即可,考虑按照真是访问顺序添加 referer,初始的 referer 可以使用 google。
  6. Accept-Encoding 从 gzip 和 deflate 中选,好多网站会强行返回 gzip 的结果
  7. Aceept-Language 根据情况选择,比如 zh-CN, en-US

cookies

cookie 是需要更新的
 

others

 
可能有一些人类不可见的陷阱链接,不要访问这些链接

爬取间隔自适应

就是已经限制了你这个 IP 的抓取,就不要傻傻重复试,怎么也得休息一会。网易云音乐操作起来比较简单,sleep 一下就好了。其实 sleep 的间隔应该按情况累加,比如第一次 sleep 10 秒,发现还是被约束。那么久 sleep 20 秒。.. 这个间隔的设置已经自适应的最终效果是经验值。

ref

  1. http://www.cnblogs.com/jexus/p/5471665.html

通用爬虫系统设计

总的来说,爬虫可能需要处理两种任务,一种是日常的大规模爬取,可能需要对某些站点或者全网进行周期性爬取;另一种可能是临时对某个站点的某类网页提取一些数据,只需要爬取特定的一类网页。两者没有特别明晰的界限,而且后者也可以直接从第一种已经爬过的网页中提取数据就可以了。另外,对于后者来说,抓过的页面不需要再抓,而对于搜索引擎来说,还需要分辨出哪些连接需要反复抓。

竞品监控,对于竞争对手,监控对方的数据;对于潜在收购对象,监控对方数据是否真实。在这其中,数据的可视化非常重要。[4]

  • 爬取整站思路:使用图遍历算法
  • 爬取更新思路:找列表页,不断刷新获得更新

  • hub 和 detail 两种页面不应该严格区分,而是作为每一个页面的两个属性

  • 网页上的链接应该分两种类型:button 和 anchor。button 在同一个页面内,window 不会消失;anchor 会加载新的页面
  • 一个页面内抓取的是列表还是单个数据。值的列表如何重组为对象的列表。如果乳量不宜,对应就丢了,很棘手。
  • 如果用 url 做主键也有问题,url 可能是不更新的,而页面内容在更新

需要关注的指标

  • 网页的成功率(200 OK)
  • 网页的下载时长
  • 网页的大小
  • html 解析成功率
  • crawl rate,新链接的速率
  • 旧链接的比例

评估指标:覆盖度,时效性,死链率

提高抓取效率

  1. 使用自己的 dns 是一个提高速度的很好方法
  2. 使用 [bloom filter][1]
  3. 如果可能的话,可以使用 google、bing、baidu, archive.org
    1. 去发现站点的新连接
    2. 获取 meta 信息
    3. 直接抓取 google 的缓存
  4. auto-throttle algorithm
  5. 使用你的用户作为出口节点
  6. 抓取并使用所有有外网访问权限的 web 服务作为节点
  7. 反向生成 [站点模板][3]
  8. 每个域名后面有多少 IP 都需要统计
  9. 对于过大的网页要抛弃,抓网页大小设置成 2M 就可以了
  10. 对站点抓取要控制频率,每个 domain 有一个访问频率统计
  11. 每个抓取机器都有自己的 DNS 缓存
  12. 要把文件汇聚成大文件存储,而不要每个网页都存储一个文件,减少磁盘寻道时间。也就是 GFS 呗。
  13. 一定要启用 gzip,会大规模的减少数据的吞吐

重新访问策略

我们知道网页总是会更新的。在大规模的网络爬取中,一个很重要的问题是重抓策略,也就是在什么时候去重新访问同一个网页已获得更新。要获得这个问题的解,需要满足如下两个条件:

  1. 尽可能地少访问,以减少自身和对方站点的资源占用
  2. 尽可能快的更新,以便获得最新结果

这两个条件几乎是对立的,所以我们必须找到一种算法,并获得一个尽可能优的折衷。

可以使用泊松过程:https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages

参考

 
1. http://www.cnblogs.com/coser/archive/2012/03/16/2402389.html
2. http://www.zhihu.com/question/24326030/answer/71813450
3. https://www.zhihu.com/question/27621722
4. https://en.wikipedia.org/wiki/Web_crawler
5. https://intoli.com/blog/aopic-algorithm/

IPython main features

打印所有历史到文件:%history -g -f filename

  • use ? to get quick help, and use ?? provides additional detail).

  • Searching through modules and namespaces with * wildcards, both when using the ? system and via the %psearch command.

  • magic commands, % or %% prefixed commands

  • Alias facility for defining your own system aliases.

  • Lines starting with ! are passed directly to the system shell, and using !! or var = !cmd captures shell output into python variables for further use.

  • python variable prefixed with $ is expanded. A double $$ allows passing a literal $ to the shell (for access to shell and environment variables like PATH).

  • Filesystem navigation, via a magic %cd command, along with a persistent bookmark system (using %bookmark) for fast access to frequently visited directories.

  • A lightweight persistence framework via the %store command, which allows you to save arbitrary Python variables. These get restored when you run the %store -r command.

  • Automatic indentation and highlighting of code as you type (through the prompt_toolkit library).

  • %macro command. Macros can be stored persistently via %store and edited via %edit.

  • Session logging (you can then later use these logs as code in your programs). Logs can optionally timestamp all input, and also store session output (marked as comments, so the log remains valid Python source code).
    Session restoring: logs can be replayed to restore a previous session to the state where you left it.

  • Auto-parentheses via the %autocall command: callable objects can be executed without parentheses: sin 3 is automatically converted to sin(3)

  • Auto-quoting: using ,, or ; as the first character forces auto-quoting of the rest of the line: ,my_function a b becomes automatically my_function("a","b"), while ;myfunction a b becomes myfunction(“a b”).

  • Extensible input syntax. You can define filters that pre-process user input to simplify input in special situations. This allows for example pasting multi-line code fragments which start with >>> or … such as those from other python sessions or the standard Python documentation.

Flexible configuration system. It uses a configuration file which allows permanent setting of all command-line options, module loading, code and file execution. The system allows recursive file inclusion, so you can have a base file with defaults and layers which load other customizations for particular projects.

  • Embeddable. You can call IPython as a python shell inside your own python programs. This can be used both for debugging code or for providing interactive abilities to your programs with knowledge about the local namespaces (very useful in debugging and data analysis situations).

  • Easy debugger access. You can set IPython to call up an enhanced version of the Python debugger (pdb) every time there is an uncaught exception. This drops you inside the code which triggered the exception with all the data live and it is possible to navigate the stack to rapidly isolate the source of a bug. The %run magic command (with the -d option) can run any script under pdb’s control, automatically setting initial breakpoints for you. This version of pdb has IPython-specific improvements, including tab-completion and traceback coloring support. For even easier debugger access, try %debug after seeing an exception.

  • Profiler support. You can run single statements (similar to profile.run()) or complete programs under the profiler’s control. While this is possible with standard cProfile or profile modules, IPython wraps this functionality with magic commands (see %prun and %run -p) convenient for rapid interactive work.

  • Simple timing information. You can use the %timeit command to get the execution time of a Python statement or expression. This machinery is intelligent enough to do more repetitions for commands that finish very quickly in order to get a better estimate of their running time.

In [1]: %timeit 1+1
10000000 loops, best of 3: 25.5 ns per loop

In [2]: %timeit [math.sin(x) for x in range(5000)]
1000 loops, best of 3: 719 µs per loop

To get the timing information for more than one expression, use the %%timeit cell magic command.

  • Doctest support. The special %doctest_mode command toggles a mode to use doctest-compatible prompts, so you can use IPython sessions as doctest code. By default, IPython also allows you to paste existing doctests, and strips out the leading >>> and … prompts in them.