Month: 八月 2017

xpath generator 是如何实现的?

写爬虫的话做到最后基本上最终没法自动化的就是指定要抽取的元素的xpath了, 要定向爬一个网站的内容基本上都会归结到去找下一页和数据元素的xpath. 如果能把xpath的生成交给不会写程序的运营同学来做的话, 能够极大地解放程序员的生产力.

毕竟xpath也算是一个DSL, 对于不会编程的同学还是有一定难度的. SQL写得熟练的PM多得是, 想找一个会写xpath的运营同学则是很困难, 毕竟术业有专攻, 运营需要面对的问题和我们程序猿还是有很大不同. 多年的经验, 感觉能教会他们yaml已经是极限了…

那么能不能有一个图形化的工具来生成xpath呢? 答案显然是有的, chrome浏览器就内置了生成xpath的工具, 如下图所示:

![chrome xpath](https://ws4.sinaimg.cn/large/006tNc79ly1flhwpu64uvj31f20keaj4.jpg)

这幅图生成的xpath是: `//*[@id=”fc_B_pic”]/ul[1]/li[1]/a[1]`

然而chrome的xpath生成却有几个缺点:

1. chrome的xpath只会想上去找带有id的元素, 而根据实际的情况, 往往找到带有class的元素就可以保证找的xpath是对的了.
2. chrome生成的元素是尽量保证元素唯一的, 也就是当你想要搞一个能能够选中多个元素的xpath时, chrome 无能为力, 还是需要自己去改写.
3. 另外就是生成之后不能方便的用图形工具去验证.

未完待续

对头条的思考

# 读张一鸣微博

> 可假设某位下属提辞职,感受如是『很可惜』『很在意』那么应现在考虑如何给增加回报和空间,相反,如觉得『也好』『更好』甚至轻松,那应考虑是否做出调岗或辞退。

> 时常看看旧闻,就知道媒体多么不靠谱

> 大剑无锋大巧不工

> 在全网的上抓取分析提取信息,遇到的大问题就是 spam 以及 spam 的高级形式软文或者枪文,pangerank 是网页的投票,sns 利用可标识的人行为来投票,sns 平台之外还有好方法吗。

> 风景长宜放眼量。往事不可谏,来者犹可追。

他应该是一个类似张小龙一样,技术出身的优秀的产品经理

# 反思

阿里的月饼风波过后,一鸣好像说过:“我们中秋节就不发月饼,因为优秀的工程师不会在乎那一盒月饼的。” 其实这句话就非常傻,优秀的员工可能这样想:“一盒月饼才多少钱,公司连这都舍不得,在其他事情上会舍得为员工付出吗?”。月饼这种东西显然是 ROI 非常高的一个员工福利,所以不要为了彰显和阿里不同而故意说反话。当然,实际上头条的福利很好,不光有月饼,过年也有礼物。

Google 的价值观就很短:Don’t be evil. 以至于大家都知道。而百度的价值观也很短:简单可依赖。虽然不管是 Google 还是百度都没有很好地践行他们的价值观,但是至少起到了宣传作用吧。头条的价值观太长了,而且是五六个独立的词语,到现在我都记不下来。平均下来头条员工在头条的时间可能连两年都不到,你让人记这么长一大串东西,你又不是社会主义核心价值观

爬虫爬来的数据是否可信

如果用来分析的数据本来就是错的,那么得出的结论必然也是有问题的。比如2016年美国大选中,由于川普的支持者经常被侮辱,导致在电话调查选民中,大家都声称自己支持希拉里,可是实际上大家都投给了川普。电话调查的结果本来就是错的,所以大家都认为希拉里会赢。川普团队则采取的是问选民你认为你的邻居会投谁,从而得到了正确结果。

爬虫爬到的数据中也有可能是有问题的,比如租房网站的假房源,招聘网站上的虚假职位,用户故意不填写真实信息以保护隐私等等;微信文章被刷多的阅读数;而且编写不良的爬虫很可能误入蜜罐,得到的数据更有问题。

比如说借助爬来的新闻分析房产数据,实际上住建部禁止发布涨价相关预测,也就是对于市场的情绪表达是有影响的,那么我们如果按照这个数据来做预测显然是不对的。

十年学会编程

对编程产生感兴趣并因为乐趣而写程序。确信你自始至终都能乐在其中,这样你才愿意将十年光阴投入编程事业.

与其他程序员交流;阅读别人的代码。这比任何书任何培训都重要。

记住在 “计算机科学” 中包括 “计算机” 这个词。要知道你的计算机执行一条指令需要多久,到内存中取一个字需要多久(缓存是否击中), 到磁盘读取连续的字需要多久,而磁盘的定位又需要多久.

Fred Brooks (人月神话作者) 在他的文章 没有银弹 中指出,发掘卓越软体设计者的三部曲:

1. 尽早尽可能地以系统化的方式发掘最佳设计人员。
2. 给有潜力者指派生涯规划师,并谨慎地规划他们的职业生涯。
3. 提供机会给正在成长的程序员,让他们能相互影响,彼此激励。

左耳朵耗子的绩效观

制定目标和绩效,目的不是用来考核人的,而用来改善提高组织和人员业绩和效率的。

人是复杂的,人是有状态波动的,任何时候都不应该轻易否定人,绩效考核应该考核的是事情,而不是人。

考核价值观最大的问题就是非常容易的上纲上线,也非常容易的被制造政治斗争,也非常容易的扼杀各种不同思想,老实说,这从很大程度上是一种洗脑的手段——通过对人制造一种紧张或恐惧而达到控制思想的目的。

KPI适合把人当机器用的行业,而OKR适合人人都是公司一员的创新行业。

YF:
考核的的尺子一定要长,是为了

在一个公司里,每天都会产生很多新的想法,并不是每一个想法都会落到实处,因为有的人没有时间做,有的想法纯属脑洞大开,有的想法有其他位置问题不可能实现,有的可能提出的人只有个创意,实施的人做出了花。

如果有重大的事,或者一定要做的事,要落到纸面上,有排期,这样才能真正做起来。当然如果做到了管理岗位,切忌把所有事都写下来,因为好大一部分是前面提到的做不做无所谓的事。要给每个人自由,才能让他们发挥到自己的长处。

所以,要用okr,而不是kpi。

阅读 instagram 的 python 升级文章

在 Instagram 的用户数迅速增长的过程中,性能问题还是出现了:服务器数量的增长率已经慢慢的超过了用户增长率。

为此,他们决定跳过 Python 2 中哪些蹩脚的异步 IO 实现 (可怜的 gevent、tornado、twisted 众),直接升级到 Python 3,去探索标准库中的 asyncio 模块所能带来的可能性。

在 Instagram,进行 Python 3 的迁移需要必须满足两个前提条件:

– 不停机,不能有任何的服务因此不可用
– 不能影响产品新特性的开发

Dropbox CEO’s speech

Dropbox CEO’s speech

Bill Gates’s first company made software for traffic lights. Steve Jobs’s first company made plastic whistles that let you make free phone calls. Both failed, but it’s hard to imagine they were too upset about it. That’s my favourite thing that changes today. You no longer carry around a number indicating the sum of all your mistakes. From now on, failure doesn’t matter: you only have to be right once.

So that’s how 30,000 ended up on the cheat sheet. That night, I realised there are no warmups, no practice rounds, no reset buttons. Every day we’re writing a few more words of a story. And when you die, it’s not like “here lies Drew, he came in 174th place.” So from then on, I stopped trying to make my life perfect, and instead tried to make it interesting. I wanted my story to be an adventure — and that’s made all the difference.

And today on your commencement, your first day of life in the real world, that’s what I wish for you. Instead of trying to make your life perfect, give yourself the freedom to make it an adventure, and go ever upward. Thank you.

It took me a while to get it, but the hardest-working people don’t work hard because they’re disciplined. They work hard because working on an exciting problem is fun. So after today, it’s not about pushing yourself; it’s about finding your tennis ball, the thing that pulls you. It might take a while, but until you find it, keep listening for that little voice.

Fortunately, it doesn’t matter. No one has a 5.0 in real life. In fact, when you finish school, the whole notion of a GPA just goes away. When you’re in school, every little mistake is a permanent crack in your windshield. But in the real world, if you’re not swerving around and hitting the guard rails every now and then, you’re not going fast enough. Your biggest risk isn’t failing, it’s getting too comfortable.

Honestly, I don’t think I’ve ever been “ready.” I remember the day our first investors said yes and asked us where to send the money. For a 24 year old, this is Christmas — and opening your present is hitting refresh over and over on bankofamerica.com and watching your company’s checking account go from 60 dollars to 1.2 million dollars. At first I was ecstatic — that number has two commas in it! I took a screenshot — but then I was sick to my stomach. Someday these guys are going to want this back. What the hell have I gotten myself into?

They say that you’re the average of the 5 people you spend the most time with. Think about that for a minute: who would be in your circle of 5? I have some good news: MIT is one of the best places in the world to start building that circle. If I hadn’t come here, I wouldn’t have met Adam, I wouldn’t have met my amazing cofounder, Arash, and there would be no Dropbox.

And now your circle will grow to include your coworkers and everyone around you. Where you live matters: there’s only one MIT. And there’s only one Hollywood and only one Silicon Valley. This isn’t a coincidence: for whatever you’re doing, there’s usually only one place where the top people go. You should go there. Don’t settle for anywhere else. Meeting my heroes and learning from them gave me a huge advantage. Your heroes are part of your circle too — follow them. If the real action is happening somewhere else, move.

One thing I’ve learned is surrounding yourself with inspiring people is now just as important as being talented or working hard. Can you imagine if Michael Jordan hadn’t been in the NBA, if his circle of 5 had been a bunch of guys in Italy? Your circle pushes you to be better, just as Adam pushed me.

I was thrilled for him, but it was a shock for me. Here was my faithful beer pong partner and my little brother in the fraternity, two years younger than me. I was out of excuses. He was off to the Super Bowl and I wasn’t even getting drafted. He had no idea at the time, but Adam had given me just the kick I needed. It was time for a change.

爬虫如何尽量模拟浏览器

# http headers
 
发送http请求时,Host, Connection, Accept, User-Agent, Referer, Accept-Encoding, Accept-Language这七个头必须添加,因为正常的浏览器都会有这7个头。
 
其中:

1. Host一般各种库都已经填充了
2. Connection填Keep-Alive
3. Accept一般填text/html 或者application/json
4. User-Agent使用自己的爬虫或者伪造浏览器的UA
5. Referer一般填当前URL即可,考虑按照真是访问顺序添加referer,初始的referer可以使用google。
6. Accept-Encoding 从gzip和deflate中选,好多网站会强行返回gzip的结果
7. Aceept-Language根据情况选择,比如zh-CN, en-US

# cookies

cookie是需要更新的
 
# others
 
可能有一些人类不可见的陷阱链接,不要访问这些链接

# 爬取间隔自适应

就是已经限制了你这个IP的抓取,就不要傻傻重复试,怎么也得休息一会。网易云音乐操作起来比较简单,sleep一下就好了。其实sleep的间隔应该按情况累加,比如第一次sleep 10秒,发现还是被约束。那么久sleep 20秒… 这个间隔的设置已经自适应的最终效果是经验值。

ref

1. http://www.cnblogs.com/jexus/p/5471665.html

IPython main features

* use `?` to get quick help, and use `??` provides additional detail).

* Searching through modules and namespaces with * wildcards, both when using the ? system and via the %psearch command.

* magic commands, % or %% prefixed commands

* Alias facility for defining your own system aliases.

* Lines starting with ! are passed directly to the system shell, and using !! or var = !cmd captures shell output into python variables for further use.

* python variable prefixed with $ is expanded. A double $$ allows passing a literal $ to the shell (for access to shell and environment variables like PATH).

* Filesystem navigation, via a magic %cd command, along with a persistent bookmark system (using %bookmark) for fast access to frequently visited directories.

* A lightweight persistence framework via the %store command, which allows you to save arbitrary Python variables. These get restored when you run the %store -r command.

* Automatic indentation and highlighting of code as you type (through the prompt_toolkit library).

* %macro command. Macros can be stored persistently via %store and edited via %edit.
* Session logging (you can then later use these logs as code in your programs). Logs can optionally timestamp all input, and also store session output (marked as comments, so the log remains valid Python source code).
Session restoring: logs can be replayed to restore a previous session to the state where you left it.

* Auto-parentheses via the %autocall command: callable objects can be executed without parentheses: sin 3 is automatically converted to sin(3)
* Auto-quoting: using `,`, or `;` as the first character forces auto-quoting of the rest of the line: `,my_function a b` becomes automatically `my_function(“a”,”b”)`, while ;my_function a b becomes my_function(“a b”).

* Extensible input syntax. You can define filters that pre-process user input to simplify input in special situations. This allows for example pasting multi-line code fragments which start with >>> or … such as those from other python sessions or the standard Python documentation.

Flexible configuration system. It uses a configuration file which allows permanent setting of all command-line options, module loading, code and file execution. The system allows recursive file inclusion, so you can have a base file with defaults and layers which load other customizations for particular projects.

* Embeddable. You can call IPython as a python shell inside your own python programs. This can be used both for debugging code or for providing interactive abilities to your programs with knowledge about the local namespaces (very useful in debugging and data analysis situations).

* Easy debugger access. You can set IPython to call up an enhanced version of the Python debugger (pdb) every time there is an uncaught exception. This drops you inside the code which triggered the exception with all the data live and it is possible to navigate the stack to rapidly isolate the source of a bug. The %run magic command (with the -d option) can run any script under pdb’s control, automatically setting initial breakpoints for you. This version of pdb has IPython-specific improvements, including tab-completion and traceback coloring support. For even easier debugger access, try %debug after seeing an exception.

* Profiler support. You can run single statements (similar to profile.run()) or complete programs under the profiler’s control. While this is possible with standard cProfile or profile modules, IPython wraps this functionality with magic commands (see %prun and %run -p) convenient for rapid interactive work.

* Simple timing information. You can use the %timeit command to get the execution time of a Python statement or expression. This machinery is intelligent enough to do more repetitions for commands that finish very quickly in order to get a better estimate of their running time.
“`
In [1]: %timeit 1+1
10000000 loops, best of 3: 25.5 ns per loop

In [2]: %timeit [math.sin(x) for x in range(5000)]
1000 loops, best of 3: 719 µs per loop
“`
To get the timing information for more than one expression, use the %%timeit cell magic command.

* Doctest support. The special %doctest_mode command toggles a mode to use doctest-compatible prompts, so you can use IPython sessions as doctest code. By default, IPython also allows you to paste existing doctests, and strips out the leading >>> and … prompts in them.

图灵奖获得者John Hopcropt讲座

# revolutions

agri revolutions 10000BC

industrial revolution 1700 AD

information revolution 2015 AD

# jobs

there used to be elevator operators, but this job disappears, so will drivers

what if 25% of the work force will be needed to produce all the goods and services

we are living in a changing world, job in the future will require a sophisticated education well beyond that available today

# China’s education

# deep learning
many layered network, the first layers learn it’s a image, the older layers learn the style and content but lose the iamge pixels

SVM is a big advance, deep networks is a big advance, but we don’t under deep networks