Month: 八月 2017

xpath generator 是如何实现的?

写爬虫的话做到最后基本上最终没法自动化的就是指定要抽取的元素的 xpath 了,要定向爬一个网站的内容基本上都会归结到去找下一页和数据元素的 xpath. 如果能把 xpath 的生成交给不会写程序的运营同学来做的话,能够极大地解放程序员的生产力。

毕竟 xpath 也算是一个 DSL, 对于不会编程的同学还是有一定难度的。SQL 写得熟练的 PM 多得是,想找一个会写 xpath 的运营同学则是很困难,毕竟术业有专攻,运营需要面对的问题和我们程序猿还是有很大不同。多年的经验,感觉能教会他们 yaml 已经是极限了…

那么能不能有一个图形化的工具来生成 xpath 呢?答案显然是有的,chrome 浏览器就内置了生成 xpath 的工具,如下图所示:

chrome xpath

这幅图生成的 xpath 是://*[@id="fc_B_pic"]/ul[1]/li[1]/a[1]

然而 chrome 的 xpath 生成却有几个缺点:

  1. chrome 的 xpath 只会想上去找带有 id 的元素,而根据实际的情况,往往找到带有 class 的元素就可以保证找的 xpath 是对的了。
  2. chrome 生成的元素是尽量保证元素唯一的,也就是当你想要搞一个能能够选中多个元素的 xpath 时,chrome 无能为力,还是需要自己去改写。
  3. 另外就是生成之后不能方便的用图形工具去验证。

未完待续

对头条的思考

读张一鸣微博

可假设某位下属提辞职,感受如是『很可惜』『很在意』那么应现在考虑如何给增加回报和空间,相反,如觉得『也好』『更好』甚至轻松,那应考虑是否做出调岗或辞退。

时常看看旧闻,就知道媒体多么不靠谱

大剑无锋大巧不工

在全网的上抓取分析提取信息,遇到的大问题就是 spam 以及 spam 的高级形式软文或者枪文,pangerank 是网页的投票,sns 利用可标识的人行为来投票,sns 平台之外还有好方法吗。

风景长宜放眼量。往事不可谏,来者犹可追。

他应该是一个类似张小龙一样,技术出身的优秀的产品经理

反思

阿里的月饼风波过后,一鸣好像说过:“我们中秋节就不发月饼,因为优秀的工程师不会在乎那一盒月饼的。” 其实这句话就非常傻,优秀的员工可能这样想:“一盒月饼才多少钱,公司连这都舍不得,在其他事情上会舍得为员工付出吗?”。月饼这种东西显然是 ROI 非常高的一个员工福利,所以不要为了彰显和阿里不同而故意说反话。当然,实际上头条的福利很好,不光有月饼,过年也有礼物。

Google 的价值观就很短:Don’t be evil. 以至于大家都知道。而百度的价值观也很短:简单可依赖。虽然不管是 Google 还是百度都没有很好地践行他们的价值观,但是至少起到了宣传作用吧。头条的价值观太长了,而且是五六个独立的词语,到现在我都记不下来。平均下来头条员工在头条的时间可能连两年都不到,你让人记这么长一大串东西,你又不是社会主义核心价值观。

十年学会编程

对编程产生感兴趣并因为乐趣而写程序。确信你自始至终都能乐在其中,这样你才愿意将十年光阴投入编程事业.

与其他程序员交流;阅读别人的代码。这比任何书任何培训都重要。

记住在 “计算机科学” 中包括 “计算机” 这个词。要知道你的计算机执行一条指令需要多久,到内存中取一个字需要多久(缓存是否击中), 到磁盘读取连续的字需要多久,而磁盘的定位又需要多久.

Fred Brooks (人月神话作者) 在他的文章 没有银弹 中指出,发掘卓越软体设计者的三部曲:

  1. 尽早尽可能地以系统化的方式发掘最佳设计人员。
  2. 给有潜力者指派生涯规划师,并谨慎地规划他们的职业生涯。
  3. 提供机会给正在成长的程序员,让他们能相互影响,彼此激励。

左耳朵耗子的绩效观

制定目标和绩效,目的不是用来考核人的,而用来改善提高组织和人员业绩和效率的。

人是复杂的,人是有状态波动的,任何时候都不应该轻易否定人,绩效考核应该考核的是事情,而不是人。

考核价值观最大的问题就是非常容易的上纲上线,也非常容易的被制造政治斗争,也非常容易的扼杀各种不同思想,老实说,这从很大程度上是一种洗脑的手段——通过对人制造一种紧张或恐惧而达到控制思想的目的。

KPI 适合把人当机器用的行业,而 OKR 适合人人都是公司一员的创新行业。

Yifei 的想法

考核的的尺子一定要长,是为了:

在一个公司里,每天都会产生很多新的想法,并不是每一个想法都会落到实处,因为有的人没有时间做,有的想法纯属脑洞大开,有的想法有其他位置问题不可能实现,有的可能提出的人只有个创意,实施的人做出了花。

如果有重大的事,或者一定要做的事,要落到纸面上,有排期,这样才能真正做起来。当然如果做到了管理岗位,切忌把所有事都写下来,因为好大一部分是前面提到的做不做无所谓的事。要给每个人自由,才能让他们发挥到自己的长处。

所以,要用 okr,而不是 kpi。

阅读 instagram 的 python 升级文章

在 Instagram 的用户数迅速增长的过程中,性能问题还是出现了:服务器数量的增长率已经慢慢的超过了用户增长率。

为此,他们决定跳过 Python 2 中哪些蹩脚的异步 IO 实现 (可怜的 gevent、tornado、twisted 众),直接升级到 Python 3,去探索标准库中的 asyncio 模块所能带来的可能性。

在 Instagram,进行 Python 3 的迁移需要必须满足两个前提条件:

  • 不停机,不能有任何的服务因此不可用
  • 不能影响产品新特性的开发

Dropbox CEO’s speech

Dropbox CEO’s speech

Bill Gates’s first company made software for traffic lights. Steve Jobs’s first company made plastic whistles that let you make free phone calls. Both failed, but it’s hard to imagine they were too upset about it. That’s my favourite thing that changes today. You no longer carry around a number indicating the sum of all your mistakes. From now on, failure doesn’t matter: you only have to be right once.

So that’s how 30,000 ended up on the cheat sheet. That night, I realised there are no warmups, no practice rounds, no reset buttons. Every day we’re writing a few more words of a story. And when you die, it’s not like “here lies Drew, he came in 174th place.” So from then on, I stopped trying to make my life perfect, and instead tried to make it interesting. I wanted my story to be an adventure — and that’s made all the difference.

And today on your commencement, your first day of life in the real world, that’s what I wish for you. Instead of trying to make your life perfect, give yourself the freedom to make it an adventure, and go ever upward. Thank you.

It took me a while to get it, but the hardest-working people don’t work hard because they’re disciplined. They work hard because working on an exciting problem is fun. So after today, it’s not about pushing yourself; it’s about finding your tennis ball, the thing that pulls you. It might take a while, but until you find it, keep listening for that little voice.

Fortunately, it doesn’t matter. No one has a 5.0 in real life. In fact, when you finish school, the whole notion of a GPA just goes away. When you’re in school, every little mistake is a permanent crack in your windshield. But in the real world, if you’re not swerving around and hitting the guard rails every now and then, you’re not going fast enough. Your biggest risk isn’t failing, it’s getting too comfortable.

Honestly, I don’t think I’ve ever been “ready.” I remember the day our first investors said yes and asked us where to send the money. For a 24 year old, this is Christmas — and opening your present is hitting refresh over and over on bankofamerica.com and watching your company’s checking account go from 60 dollars to 1.2 million dollars. At first I was ecstatic — that number has two commas in it! I took a screenshot — but then I was sick to my stomach. Someday these guys are going to want this back. What the hell have I gotten myself into?

They say that you’re the average of the 5 people you spend the most time with. Think about that for a minute: who would be in your circle of 5? I have some good news: MIT is one of the best places in the world to start building that circle. If I hadn’t come here, I wouldn’t have met Adam, I wouldn’t have met my amazing cofounder, Arash, and there would be no Dropbox.

And now your circle will grow to include your coworkers and everyone around you. Where you live matters: there’s only one MIT. And there’s only one Hollywood and only one Silicon Valley. This isn’t a coincidence: for whatever you’re doing, there’s usually only one place where the top people go. You should go there. Don’t settle for anywhere else. Meeting my heroes and learning from them gave me a huge advantage. Your heroes are part of your circle too — follow them. If the real action is happening somewhere else, move.

One thing I’ve learned is surrounding yourself with inspiring people is now just as important as being talented or working hard. Can you imagine if Michael Jordan hadn’t been in the NBA, if his circle of 5 had been a bunch of guys in Italy? Your circle pushes you to be better, just as Adam pushed me.

I was thrilled for him, but it was a shock for me. Here was my faithful beer pong partner and my little brother in the fraternity, two years younger than me. I was out of excuses. He was off to the Super Bowl and I wasn’t even getting drafted. He had no idea at the time, but Adam had given me just the kick I needed. It was time for a change.

爬虫如何尽量模拟浏览器

http headers

 
发送http请求时,Host, Connection, Accept, User-Agent, Referer, Accept-Encoding, Accept-Language这七个头必须添加,因为正常的浏览器都会有这7个头。
 
其中:

  1. Host一般各种库都已经填充了
  2. Connection填Keep-Alive
  3. Accept一般填text/html 或者application/json
  4. User-Agent使用自己的爬虫或者伪造浏览器的UA
  5. Referer一般填当前URL即可,考虑按照真是访问顺序添加referer,初始的referer可以使用google。
  6. Accept-Encoding 从gzip和deflate中选,好多网站会强行返回gzip的结果
  7. Aceept-Language根据情况选择,比如zh-CN, en-US

cookies

cookie是需要更新的
 

others

 
可能有一些人类不可见的陷阱链接,不要访问这些链接

爬取间隔自适应

就是已经限制了你这个IP的抓取,就不要傻傻重复试,怎么也得休息一会。网易云音乐操作起来比较简单,sleep一下就好了。其实sleep的间隔应该按情况累加,比如第一次sleep 10秒,发现还是被约束。那么久sleep 20秒… 这个间隔的设置已经自适应的最终效果是经验值。

ref

  1. http://www.cnblogs.com/jexus/p/5471665.html

IPython main features

打印所有历史到文件:%history -g -f filename

  • use ? to get quick help, and use ?? provides additional detail).

  • Searching through modules and namespaces with * wildcards, both when using the ? system and via the %psearch command.

  • magic commands, % or %% prefixed commands

  • Alias facility for defining your own system aliases.

  • Lines starting with ! are passed directly to the system shell, and using !! or var = !cmd captures shell output into python variables for further use.

  • python variable prefixed with $ is expanded. A double $$ allows passing a literal $ to the shell (for access to shell and environment variables like PATH).

  • Filesystem navigation, via a magic %cd command, along with a persistent bookmark system (using %bookmark) for fast access to frequently visited directories.

  • A lightweight persistence framework via the %store command, which allows you to save arbitrary Python variables. These get restored when you run the %store -r command.

  • Automatic indentation and highlighting of code as you type (through the prompt_toolkit library).

  • %macro command. Macros can be stored persistently via %store and edited via %edit.

  • Session logging (you can then later use these logs as code in your programs). Logs can optionally timestamp all input, and also store session output (marked as comments, so the log remains valid Python source code).
    Session restoring: logs can be replayed to restore a previous session to the state where you left it.

  • Auto-parentheses via the %autocall command: callable objects can be executed without parentheses: sin 3 is automatically converted to sin(3)

  • Auto-quoting: using ,, or ; as the first character forces auto-quoting of the rest of the line: ,my_function a b becomes automatically my_function("a","b"), while ;myfunction a b becomes myfunction(“a b”).

  • Extensible input syntax. You can define filters that pre-process user input to simplify input in special situations. This allows for example pasting multi-line code fragments which start with >>> or … such as those from other python sessions or the standard Python documentation.

Flexible configuration system. It uses a configuration file which allows permanent setting of all command-line options, module loading, code and file execution. The system allows recursive file inclusion, so you can have a base file with defaults and layers which load other customizations for particular projects.

  • Embeddable. You can call IPython as a python shell inside your own python programs. This can be used both for debugging code or for providing interactive abilities to your programs with knowledge about the local namespaces (very useful in debugging and data analysis situations).

  • Easy debugger access. You can set IPython to call up an enhanced version of the Python debugger (pdb) every time there is an uncaught exception. This drops you inside the code which triggered the exception with all the data live and it is possible to navigate the stack to rapidly isolate the source of a bug. The %run magic command (with the -d option) can run any script under pdb’s control, automatically setting initial breakpoints for you. This version of pdb has IPython-specific improvements, including tab-completion and traceback coloring support. For even easier debugger access, try %debug after seeing an exception.

  • Profiler support. You can run single statements (similar to profile.run()) or complete programs under the profiler’s control. While this is possible with standard cProfile or profile modules, IPython wraps this functionality with magic commands (see %prun and %run -p) convenient for rapid interactive work.

  • Simple timing information. You can use the %timeit command to get the execution time of a Python statement or expression. This machinery is intelligent enough to do more repetitions for commands that finish very quickly in order to get a better estimate of their running time.

In [1]: %timeit 1+1
10000000 loops, best of 3: 25.5 ns per loop

In [2]: %timeit [math.sin(x) for x in range(5000)]
1000 loops, best of 3: 719 µs per loop

To get the timing information for more than one expression, use the %%timeit cell magic command.

  • Doctest support. The special %doctest_mode command toggles a mode to use doctest-compatible prompts, so you can use IPython sessions as doctest code. By default, IPython also allows you to paste existing doctests, and strips out the leading >>> and … prompts in them.

图灵奖获得者 John Hopcropt 在头条的讲座

revolutions

  • agri revolutions 10000BC
  • industrial revolution 1700 AD
  • information revolution 2015 AD

jobs

  • there used to be elevator operators, but this job disappears, so will drivers
  • what if 25% of the work force will be needed to produce all the goods and services
  • we are living in a changing world, job in the future will require a sophisticated education well beyond that available today

China’s education

deep learning

  • many layered network, the first layers learn it’s a image, the older layers learn the style and content but lose the iamge pixels
  • SVM is a big advance, deep networks is a big advance, but we don’t under deep networks

最后感想

有钱真好啊,能请大佬来给自己开讲座。我也要变得有钱。

小海星的王者荣耀攻略

选人

良好的开始是成功的一半,如果在选人阶段就不和谐,基本就输了一半了。
一般一个队伍必须要的角色是坦克、射手、法师。虽然说不绝对,但是如果你是高手估计就不会来看攻略了,一般来说这三者是缺一不可的。如果你很想玩某个英雄,请直接先选,一般来说后选的人有责任根据阵容去补位,千万不要拖到最后一刻结果选了一个崩溃的阵容。当然游戏的风气基本上法师和射手都会被抢,这个时候如果你想赢,请还是补上坦克吧。其实这种事也就低端局容易出,只要分数稍微高一点,大家都懂,只要是为了赢而不是赌气的话,都会根据阵容换合适的英雄的。我理解大家尝鲜的心情,不过新手尽量还是不要频繁换不会用的英雄坑人坑己,尽量每个位置的英雄都熟一个,这样不论是却哪个位置都有的选。既然是认真用那几个英雄,当然可以搜一下英雄怎么用,倒不用像学习一样认真,但是了解一下别人的使用经验总比自己摸索要强很多,如果对面选了跟你同样的英雄还用的比你好,要多学习一下别人的思路和连招等等。
除了那3个位置之外,还有刺客和辅助可以选,定位多样性是好的,但很多刺客英雄对新手来说上手有难度,实在不行选个自己熟练的其它位置,队友通常也不会说你什么的;高端局不是这样,但也没必要来看攻略了- -。
召唤师技能这个因人而异,但关键还是讲究和英雄本身技能和定位的配合,虽然没有硬性要求,但是例如我就是喜欢闪现我什么英雄都带闪现,这并不是不可以,但是心态就是错的。某些辅助可能拿个治疗术,某些坦克和输出不足的战士可以拿斩杀,位移技能不够强的刺客可以拿急速或闪现等等。

开局分路

通常的思路,突进型坦克、战士、比较肉的刺客都是单走上路,刺客或者战士打野,法师走中路,射手和辅助走下路。辅助不一定是定位为辅助的英雄,像前期较弱势的坦克、法师等都可以和射手一起走下路。
当然阵容也是千奇百怪,关键是根据这个思路分路就行。射手一直非常脆弱,后期需要它输出很高,否则它就没有用啊,所以一般跟队友一起带线。而有些英雄可能前期并不需要它迅速发力,或者从始至终都只要放放控制技能就好,这样的英雄一般在开局是要辅助射手的。在射手、刺客、坦克的装备都没有太成型的时候,法师的技能放的快,伤害也不低,成为了战场关键,所以一般让法师走中路,中期可以带动全场,而且中路两边的草丛也比较危险,需要一个单打能力强的英雄。上路通常会碰到对面的射手加辅助的组合,想占便宜不是特别容易,所以就由血量高的英雄抗压。野区资源也是游戏中很重要的资源,打野相对安全,刺客后期的输出也是相当需要的,如果中期没有输出抢不上几个人头,后期杀不死人就废了,所以刺客一般不抗压,而是选择打野。1级打野有点困难,一般都会叫队友帮忙打红buff,一般还是不要跟打野的人抢buff。至于打野,一个人在两片野区来回刷应该是可行的,尽量别两个人都打野,每人一片野区不够分的,己方射手还要去抗压,得不偿失。如果队友有空的话倒是可以帮忙打小龙(暴君),前期打龙还是挺伤的,该跑就跑,打龙通常要看兵线,如果龙两侧的两路兵线队友都是劣势被堵在塔下不敢出去,这个时候打龙是非常危险的,尽量不要打。

什么是战士

比刺客肉,比坦克输出高

如何使用坦克

坦克一般两个活,一种是绕后别被发现,队友一交火注意力被吸引马上就贴上去打对面射手和好杀的法师,让对方输出高的射手和法师总是找不到位置输出,分隔战场,一打2的话队友暂时就是4打3,不过快没血的时候注意要留下跑路的技能。
另一种是对面刺客已经来打我们的射手了,用控制和减速技能拖住射手,让它尽快解决掉刺客。

坦克名字的由来

一开始网游里面的战士牧师法师体系,肉盾,治疗,输出三种主要角色。反正在魔兽世界里一直叫坦克的,外国人也是这么叫的。后来游戏里直接都这么写了。
在moba游戏中,这3种角色细分出很多定位。
坦克、战士、刺客。从肉到输出的一种平衡。赵云属于战士/刺客,最起码游戏说明里是这么写的- -。
输出里面分物理和法术两条路线,这也是对坦克出装路线的一个博弈。远程的就是射手和法师,近战的是刺客。
moba游戏里一般都是辅助,治疗只是其中比较小的分支,毕竟如果都治疗满了打不死那还玩毛啊。辅助一般是加状态、控制、治疗。辅助可以和坦克、战士、法师等结合一下,像芈月融合坦克和法师,蔡文姬治疗和控制,庄周也是坦克加法师。

大局观

我觉得很多队友打的都没有大局观,其实无非就是靠操作、靠脑子、靠经验。

  1. 优势拆不了高地塔,己方输出够的话应该打龙,将优势转化为攻势。
  2. 优势要强势,尽量拆塔,压制对方打兵和打野,压制经济,保证离胜利更近,当然战略上优先保证活着,操作失误死了就罢了,别看对方好几个残血就总想上,上之前一定要想好怎么回来;劣势尽可能不鲁莽开战,找机会打野带线,缩小差距,等待时机。
  3. 打龙被看见也分几种情况,状态不好一定要跑。对面虎视眈眈的情况下,打龙很难,但是防止对面打应该比较容易。如果确实我方人数占优坦克刺客之类的应该有责任去打对面来骚扰的人。
  4. 除了最外面的塔,其它的塔能守尽量守,特别是高地的塔。对面有大龙的情况下,尽量拖到大龙结束。己方刚打完大龙尽量等龙出来了再开战,要不万一团战输了龙都浪费了,最好是一开始保护好龙,当龙到对面半场的时候跟对方缠斗,让龙慢慢磨对方的塔。
  5. 开局对线发现自己打不过对面的时候,要猥琐发育别浪,除非队友特意过来帮忙击杀。合理的利用己方防御塔保护自己甚至反杀实际上是用塔弥补了你英雄的劣势,他家的防御塔只能站着看,输出都浪费了。