Month: 六月 2017

docker 基础概念

Docker 是一个进程的容器,**不是虚拟机**。他为一个进程隔离了文件系统、网络和环境变量。最好在其中运行一个且仅运行一个线程,而不是运行多个任务。

docker 中最好运行的是无状态的服务,这样方便于横向扩展,对于有状态的服务,建议把状态 mount 出来。

# 使用场景

1. 为有不同需求的应用创建不同的隔离环境, 比如部署两个脚本,一个需要用 Python 2.7,另一个需要用 Python 3.6
2. Micro services. Micro services are easy to scale up. In this way, run only one process in a container, and use orchestration tools such as compose, kubernetes, swarm.
3. Daemon Process Manager. Docker is very simple to use as a daemon process manager, to start and list daemon processes has never been this simple
4. A jail for apps. Docker is good to jail you application, prevent it from hurting your system, especially when you run code from other people(e.g. uploaded by a client)

Docker is so-called kernel containerization, in contrary to user-space containerization such as rkt. Docker stores images in a central base on your machine.

# Image vs Container(镜像与容器)

Container is a running instance of image, each time you run an image, a new container is created. You can commit a container back as an image, however, it’s a little controversial

Image name format: user/image:tag

# basic usage

* `docker run OPTIONS IMAGE COMMAND` to generate a container based on given image and start it.
* most used command is -d
* and -it
* –restart=always to always restart the container
* –name=NAME to name the container
* `docker start CONTAINER_ID` to restart stopped container, note that this will reuse the options and command when `docker run` is issued
* then use `docker attach CONTAINER_ID` to reattach to the given container
* `docker exec OPTIONS CONTAINER COMMAND` to run an extra command in container

Note, docker is all about stdio, and if you would like to read something, read it from stdin, if you would like to output something, write to stdout

# building docker images
two ways:
* commit each change
* using dockerfiles

# Commands

## Container related

### run

每次执行 `docker run`, 都会根据镜像来创建一个全新的 container, 可以使用 `docker start` 或者 `docker attach` 来连接上已经创建的 container。image 和 container 的关系大概类似于程序和进程之间的关系。

Syntax:

`docker run [options] [image name] [command]`
`docker exec -it [container] bash` can be used as a ssh equivalent
`-d` detach the container and runs in background
`-p` set ports [host:container]
`–name` set the name
`–rm` clean the container after running
`–net` sets the network connected to
`-w` sets working dir
`-e` sets env variable
`-u` sets user
`-v` sets volume host_file:container_file:options

### status

`docker ps -a` shows which container is running

## Image ralated

`docker pull`
`docker images`
`docker search`
`docker build` docker build -t user/image [dir]

## 网络相关

基础命令

“`
docker network ls ls the network interfaces
docker network inspect inspect the network for details
docker network create/rm create network interface
docker network connect/disconnect [net] [container] connect a container to a network
“`

by setting network, docker automatically create /etc/hosts file inside the image, and you can use the name of the container to access the others.

docker 有两个网络模式

### 桥接模式

使用 `docker run –net=”bridge”,这种模式会使用虚拟网卡 docker0 做了一层 NAT 转发,所以效率比较低,优点是不用改变应用分配固定端口的代码,docker会在宿主机上随机分配一个端口,避免冲突。

### Host 模式

使用 `docker run –net=”host”,宿主机和docker内部使用的都是同一个网络,比如说 eth0

## 卷

Docker 容器一般来说是无状态的,除了保存到数据库之外,还可以使用卷来把容器中的状态保存出来。

docker volume create –name hello
docker run -d -v hello:/container/path/for/volume container_image my_command

## 日志

You could use `docker logs [contianer]` to view stdout logs. But the logs sent to /var/logs/*.log are by default inside the container.

Remove stopped images
docker rm $(docker ps -aq)

使用docker的时候不使用 sudo

“`
sudo gpasswd -a ${USER} docker
“`

然后登出再登录当前用户即可

# 参考

https://blog.talpor.com/2015/01/docker-beginners-tutorial/

为爬虫搭建代理集群

爬虫如果只用固定的同一个或者同一组 IP 的话,很容易被封禁,轻者弹验证码,重者直接无法访问。

这里主要探讨如何构架一个代理 IP 池,从而能够频繁更换代理 IP。

按照代理 IP 的来源,主要有几个方法:

1. 去免费代理的网站上爬
2. 利用 ADSL 重拨会更换 IP 的原理,使用 ADSL 机器搭建集群
3. 利用云提供商的 API,自动更换 IP

# 搭建一个自己的 adsl 集群

## 找代理商

首先找到一个靠谱的网站就实属不易,这些 adsl 提供商的技术水平普遍不高,往往只能提供 centos 镜像,有 centos 7.1就算不错的了,其中有一家竟然提供了 ubuntu 14.04,结果还是各种问题,坑了我大概半天的时间。

Dockerfile 基础教程

Dockerfile 列出了构建一个docker image 的可复现步骤。比起一步一步通过 docker commit 来制作一个镜像,dockerfile更适用于 CI 自动测试等系统。

Dockerfile 命令

  • FROM,指定基础镜像
  • MAINTAINER,作者,建议格式(Jon Snow <jonsnow@westros.com>
  • EXPOSE,需要暴露的端口,但是一般也会使用 -p 来制定端口映射
  • USER,运行的用户
  • WORKDIR,进程的工作目录
  • COPY,复制文件到
  • RUN,运行shell命令
  • CMD,启动进程使用的命令
  • ENTRYPOINT,镜像启动的入口,默认是 bash -c
  • ENV,设定环境变量
  • VOLUME,卷

几个比较容易混淆的

COPY vs ADD

ADD 会自动解压压缩包,在不需要特殊操作的时候,最好使用COPY。

ENTRYPOINT vs CMD

entrypoint 指定了 Docker 镜像要运行的二进制文件(当然也包括参数),而 cmd 则指定了运行这个二进制文件的参数。不过因为默认 entrypoint 是 bash -c,所以实际上 CMD 指定的也是要运行的命令。

另外,docker run 时候包含命令行参数,会执行命令行参数,而不是 CMD 的内容。如果使用 /bin/bash 作为命令行的指令,这样便替换掉 CMD 的内容,从而进入镜像中查看编译出的镜像究竟是什么样的。

个人倾向于只使用 CMD,而不使用 ENTRYPOINT

如何理解 VOLUME 指令

Dockerfile 中的 volume 指定了一个匿名的 docker volume,也就是说在 docker run 的时候,docker 会把对应的目录mount 到一个匿名的卷。当然如果使用 -v 参数指定了 mount 到哪个目录,或者是指定了卷名,那就不会采用匿名的卷了。

使用Dockerfile 还是 commit 来构建镜像

如果可能的话,尽量使用 dockerfile,因为是可复现的。

I’ve been wondering the same thing, and my impression (which could be totally wrong) it that it’s really the same case as with VMs –> you don’t want to not know how to recreate the vm image. In my case I have regular .sh scripts to install, and am wondering why I can’t just maintain these, run docker and effectively call these, and create the golden version image that way. My scripts work to get it installed on a local PC, and the reason I want to use docker is to deal with conflicts of multiple instances of programs/have clean file system/etc
https://stackoverflow.com/questions/26110828/should-i-use-dockerfiles-or-image-commits

参考

  1. https://stackoverflow.com/a/34245657/1061155
  2. https://stackoverflow.com/questions/41935435/understanding-volume-instruction-in-dockerfile

systemd

YN:如何使安装的服务开机启动?是更改 wantedby 吗?如果是,wantedby 的值应该是什么? 对于 nginx 这样的 daemon 服务如何管理?

大多数的 Linux 系统已经选择了 systemd 来作为进程管理器。之前打算使用 supervisord 来部署服务,思考之后发现还不如直接使用 systemd 呢。这篇文章简单介绍下 systemd。 # 例子

我们从一个例子开始,比如说我们有如下的 go 程序:

<pre class="code">package main

import (
    fmt
    net/http
)

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, Hi there!)
}

func main() {
    http.HandleFunc(/, handler)
    http.ListenAndServe(:8181, nil)
}

编译到 /opt/listen/listen 这里。 首先我们添加一个用户,用来运行我们的服务:

<pre class="code">adduser -r -M -s /bin/false www-data

记下这条命令,如果需要添加用户来运行服务,可以使用这条。

Unit 文件

Unit 文件定义了一个 systemd 服务。/usr/lib/systemd/system/ 存放了系统安装的软件的 unit 文件,/etc/systemd/system/ 存放了系统自带的服务的 unit 文件。 我们编辑 /etc/systemd/system/listen.service 文件:

<pre class="code">[Unit]
Description=Listen

[Service]
User=www-data
Group=www-data
Restart=on-failure
ExecStart=/opt/listen/listen
WorkingDirectory=/opt/listen

Environment=VAR1=whatever VAR2=something else
EnvironmentFile=/path/to/file/with/variables

[Install]
WantedBy=multi-user.target

然后

<pre class="code">sudo systemctl enable listen
sudo systemctl status listen
sudo systemctl start listen

其他一些常用的操作还包括:

<pre class="code">systemctl start/stop/restart    
systemctl reload/reload-or-restart  
systemctl enable/disable    
systemctl status    
systemctl is-active 
systemctl is-enabled
systemctl is-failed
systemctl list-units [--all] [--state=…]    
systemctl list-unit-files
systemctl daemon-reload 
systemctl cat [unit-name]   
systemctl edit [uni-name]
systemctl list-dependencies [unit]

依赖管理

In that case add Requires=B and After=B to the [Unit] section of A. If the dependency is optional, add Wants=B and After=B instead. Note that Wants= and Requires= do not imply After=, meaning that if After= is not specified, the two units will be started in parallel. if you service depends on another service, use requires= + after= or wants= + after=

类型

Type: simple / forking 关于每个字段的含义,可以参考这篇文章

使用 journalctl 查看日志

首先吐槽一下, 为什么要使用 journal 这么一个拗口的单词, 叫做 logctl 不好么…

<pre class="code">journalctl -u service-name.service

还可以添加 -b 仅查看本次重启之后的日志.

启动多个实例

https://unix.stackexchange.com/questions/288236/have-systemd-spawn-n-processeshttp://0pointer.de/blog/projects/instances.html

django 小技巧

# 运行开发服务器

“`
python manage.py runserver [host:]port
“`

可以指定绑定的IP

## 创建用户和更改密码

“`
python manage.py createsuperuser # 创建超级用户
“`

“`
python manage.py changepassword username
“`

## 进入当前项目的shell

在这个 python shell 中,可以直接使用 django 的model

“`
python manage.py shell
“`

# timezone aware time

在向数据库中保存datetime字段的时候,经常会遇到 django 报警缺少时区信息,可以使用 django 自带的 timezone.now()

“`
from django.utils import timezone
now_aware = timezone.now()
“`

Thrift RPC 框架

Thrift 是一个全栈的 RPC 框架,它包含了接口定义语言(IDL)和RPC服务两部分,大概相当于 protobuf + gRPC 的功能。

# 安装

可以使用 https://github.com/yifeikong/install 中的脚本来安装

# Thrift 中的类型与 IDL

包括 `bool, byte/i8, i16, i32, i64, double, string, binary`。

– 比较蛋疼的是 thrift 不支持 uint,原因是好多语言里面没有原生无符号类型(无语。。)
– binary 类型相当于某些语言中的 bytes
– string 使用 utf-8 编码
– byte 和 i8 是同一个类型,也是有符号的

## 复合类型(struct)

struct 就像编程语言中的结构体或者类一样,用来自定义类型。注意在 Thrift 中定义类型的时候需要使用数字标记顺序,这样是为了更高效地序列化。

注意其中的 required 和 optional 字段,required 表示必选的字段,optional 的字段可以忽略。为了兼容性考虑,建议尽可能把字段声明为 optional。

“`
struct Cat {
1: required i32 number=10; // 可以有默认值
2: optional i64 big_number;
3: double decimal;
4: string name=”thrifty”; // 字符串也可以有默认值
}
“`

## exceptions

Thrift 中还可以定义异常,关键字是 exception,其他语法和 struct 一样。

## typedef

Thrift 支持 C/C++ 类型的 typedef

“`
typedef i32 MyInteger // 1
typedef Tweet ReTweet // 2
“`

## 枚举

“`
enum Operation {
ADD = 1;
SUB = 2;
MUL = 3;
DIV = 4;
}
“`

## 容器类型

Thrift 中包含了常见的容器类型 `list set map` 等。

– `list`: 一个t1类型的有序数组
– `set`: 一个t1类型的无需集合
– `map`: 一个字典,key 是 t1 类型,value 是 t2 类型

## 常量

使用 const 定义常量

“`
const i32 INT_CONST = 1234; // 1
const map MAP_CONST = {“hello”: “world”, “goodnight”: “moon”}
“`

## 注释

Thrift 支持 Python 和 C++ 类型的注释。

“`
# This is a valid comment.

/*
* This is a multi-line comment.
* Just like in C.
*/

// C++/Java style single-line comments work just as well.
“`

## 命名空间
for each thrift file, you have to add a namespace for it.

“`
namespace py tutorial
namespace java tutorial
“`

## include
include “other.thrift”

# 服务

服务类似于一个接口,在 Thrift 中定义,然后根据 Thrift 生成的文件,再使用具体的代码实现。

注意其中的 `oneway`, 意思是客户端不会等待响应。

“`
service StringCache {
void set(1:i32 key, 2:string value),
string get(1:i32 key) throws (1:KeyNotFound knf),
oneway void delete(1:i32 key)
}
“`
## 生成的代码

Thrift 的整个网络架构如图:

![](https://ws4.sinaimg.cn/large/006tKfTcgy1fslz611nmfj30y40igdj2.jpg)

生成的代码位于蓝色的一层,Transport 实现了二进制数据的传输,我们可以选择 TCP 或者 HTTP 等协议传输我们的数据。也就是Processor。Protocol 层定义了如何把Thrift内部结构的数据序列化到二进制数据,或者反过来解析,可以使用 JSON、compact 等转换方法。Processor 负责从 Protocol 中读取请求,调用用户的代码,并写入响应。Server 的实现可以有很多中,比如多线程、多进程的等等。

Processor 的定义:

“`
interface TProcessor {
bool process(TProtocol in, TProtocol out) throws TException
}
“`

Server 的具体工作:

– 创建一个 Transport 用于传输数据
– 为这个Transport创建输入输出的 Protocol
– 基于上面的 Protocol 创建 Processor
– 等待客户端请求,并且把收到的请求交给 Processor 处理,一直循环下去。

# 编译

“`
thrift -r –gen py file.thrift
“`

编译好的文件在 gen-py 目录下

– `-r` 表示递归编译
– `–gen` 指定要生成的语言

# 一个例子

handler 对应实现 service
Server 中使用 Handler

Python的 server 和 client

# 常见问题

YN: 线程安全性

1. thrift默认提供了thread/process 等不同的server类型, 需要考虑handler的线程安全问题
2. thrift client不是线程安全的, 在多线程程序中使用需要注意(http://grokbase.com/t/thrift/user/127yhv7wmx/is-the-thrift-client-thread-safe)
3. 看一下pyutil中是如何使用的…

何时需要一个 thrift 服务呢?而不是封装一个类或者 dal 来操作?

1. 跨语言,跨代码库的调用
2. 需要维持一个很重的资源的服务

如果只是同一个语言内,需要读写一些数据库之类的,封装成一个类就可以了

Const应该定义在哪儿?

如果是一个需要在调用过程中使用的常量,使用 thrift,如果是在数据库中存储,使用在代码中定义的常量

## Thrift vs http api

A few reasons other than speed:

1. Thrift generates the client and server code completely, including the data structures you are passing, so you don’t have to deal with anything other than writing the handlers and invoking the client. and everything, including parameters and returns are automatically validated and parsed. so you are getting sanity checks on your data for free.
2. Thrift is more compact than HTTP, and can easily be extended to support things like encryption, compression, non blocking IO, etc.
3. Thrift can be set up to use HTTP and JSON pretty easily if you want it (say if your client is somewhere on the internet and needs to pass firewalls)
4. Thrift supports persistent connections and avoids the continuous TCP and HTTP handshakes that HTTP incurs.

Personally, I use thrift for internal LAN RPC and HTTP when I need connections from outside.

# 参考

1. https://stackoverflow.com/questions/9732381/why-thrift-why-not-http-rpcjsongzip
2. https://thrift-tutorial.readthedocs.io/en/latest/usage-example.html#a-simple-example-to-warm-up
3. http://thrift-tutorial.readthedocs.io/en/latest/index.html
4. https://diwakergupta.github.io/thrift-missing-guide/
5. http://thrift.apache.org/tutorial/py

SSL Pinning 与破解

什么是 SSL Pinning

To view https traffic, you could sign your own root CA, and perform mitm attack to view the traffic. HPKP (http public key pinning) stops this sniffing by only trust given CA, thus, your self-signed certs will be invalid. To let given app to trust your certs, you will have to modify the apk file.

How to break it?

Introducing Xposed

decompile, modify and then recompile the apk file can be very diffcult. so you’d better hook to some api to let the app you trying to intercept trust your certs. xposed offers this kind of ability. moreover, a xposed module called JustTrustMe have done the tedious work for you. just install xposed and JustTrustMe and you are off to go. Here are the detaild steps:

  1. Install Xposed Installer

for android 5.0 above, use the xposed installer.

NOTE: 对于 MIUI,需要搜索 Xposed 安装器 MIUI 专版。

  1. Install Xposed from xposed installer, note, you have to give root privilege to xposed installer

  2. Install JustTrustMe

uwsgi 和 wsgi 协议

uWSGI is a web server than runs python web frameworks. uwsgi(lower case) is the protocol it communicates with front end web servers(nginx)

wsgi 协议

YN:

值得注意的是, wsgi实际上定义了一个同步的模型, 也就是每一个客户请求会调用一个同步的函数, 这样也就无法发挥异步的特性.

两个最简单的例子

其中实现 simple_app 函数也就是实现了wsgi协议.需要注意的有一下三点:

  1. environ字典中包含的变量
  2. start_response的参数
  3. simple_app的调用次序和返回值
HELLO_WORLD = b"Hello world!\n"

def simple_app(environ, start_response):
    """Simplest possible application object"""
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [HELLO_WORLD]

class AppClass:
    """Produce the same output, but using a class
    (Note: 'AppClass' is the "application" here, so calling it
    returns an instance of 'AppClass', which is then the iterable
    return value of the "application callable" as required by
    the spec.
    If we wanted to use *instances* of 'AppClass' as application
    objects instead, we would have to implement a '__call__'
    method, which would be invoked to execute the application,
    and we would need to create an instance for use by the
    server or gateway.
    """
    def __init__(self, environ, start_response):
        self.environ = environ
        self.start = start_response
    def __iter__(self):
        status = '200 OK'
        response_headers = [('Content-type', 'text/plain')]
        self.start(status, response_headers)
        yield HELLO_WORLD

而对于server/gateway来说, 每接收到一个http客户端, 都会调用一次这个 application callable

import os, sys
enc, esc = sys.getfilesystemencoding(), 'surrogateescape'
def unicode_to_wsgi(u):
    # Convert an environment variable to a WSGI "bytes-as-unicode" string
    return u.encode(enc, esc).decode('iso-8859-1')
def wsgi_to_bytes(s):
    return s.encode('iso-8859-1')
def run_with_cgi(application):
    environ = {k: unicode_to_wsgi(v) for k,v in os.environ.items()}
    environ['wsgi.input']        = sys.stdin.buffer
    environ['wsgi.errors']       = sys.stderr
    environ['wsgi.version']      = (1, 0)
    environ['wsgi.multithread']  = False
    environ['wsgi.multiprocess'] = True
    environ['wsgi.run_once']     = True
if environ.get('HTTPS', 'off') in ('on', '1'):
        environ['wsgi.url_scheme'] = 'https'
    else:
        environ['wsgi.url_scheme'] = 'http'
headers_set = []
    headers_sent = []
def write(data):
        out = sys.stdout.buffer
if not headers_set:
             raise AssertionError("write() before start_response()")
elif not headers_sent:
             # Before the first output, send the stored headers
             status, response_headers = headers_sent[:] = headers_set
             out.write(wsgi_to_bytes('Status: %s\r\n' % status))
             for header in response_headers:
                 out.write(wsgi_to_bytes('%s: %s\r\n' % header))
             out.write(wsgi_to_bytes('\r\n'))
out.write(data)
        out.flush()
def start_response(status, response_headers, exc_info=None):
        if exc_info:
            try:
                if headers_sent:
                    # Re-raise original exception if headers sent
                    raise exc_info[1].with_traceback(exc_info[2])
            finally:
                exc_info = None     # avoid dangling circular ref
        elif headers_set:
            raise AssertionError("Headers already set!")
headers_set[:] = [status, response_headers]
# Note: error checking on the headers should happen here,
        # *after* the headers are set.  That way, if an error
        # occurs, start_response can only be re-called with
        # exc_info set.
return write
result = application(environ, start_response)
    try:
        for data in result:
            if data:    # don't send headers until body appears
                write(data)
        if not headers_sent:
            write('')   # send headers now if body was empty
    finally:
        if hasattr(result, 'close'):
            result.close()

参考资料

  1. https://bottlepy.org/docs/dev/async.html
  2. http://uwsgi-docs-cn.readthedocs.io/zh_CN/latest/WSGIquickstart.html
  3. https://www.digitalocean.com/community/tutorials/how-to-deploy-python-wsgi-applications-using-uwsgi-web-server-with-nginx

squid proxy

# Install squid
plain old `apt-get update && apt-get install squid3 apache2-utils -y`

# Basic squid conf
`/etc/squid3/squid.conf` instead of the super bloated default config file

“`
# note that on ubuntu 16.04, use squid instead of squid3
auth_param basic program /usr/lib/squid3/basic_ncsa_auth /etc/squid3/passwords
auth_param basic realm proxy
acl authenticated proxy_auth REQUIRED
http_access allow authenticated
forwarded_for delete
http_port 0.0.0.0:3128
“`

Please note the `basic_ncsa_auth` program instead of the old `ncsa_auth`

# Setting up a user
`sudo htpasswd -c /etc/squid3/passwords username_you_like`, *on 16.04, it’s squid, not squid3*
and enter a password twice for the chosen username then
`sudo service squid3 restart`

see: https://stackoverflow.com/questions/3297196/how-to-set-up-a-squid-proxy-with-basic-username-and-password-authentication

# centos
I have to use centos, since adsl providers are not capable of providing ubuntu

check out this wonderful article: https://hostpresto.com/community/tutorials/how-to-install-and-configure-squid-proxy-on-centos-7/

“`
yum install -y epel-release
yum install -y squid
yum install -y httpd-tools
“`

“`
systemctl start squid
systemctl enable squid
touch /etc/squid/passwd && chown squid /etc/squid/passwd
htpasswd -c /etc/squid/passwd root
“`

edit `/etc/squid/squid.conf`

“`
auth_param basic program /usr/lib64/squid/basic_ncsa_auth /etc/squid/passwd
auth_param basic children 5
auth_param basic realm Squid Basic Authentication
auth_param basic credentialsttl 2 hours
acl auth_users proxy_auth REQUIRED
http_access allow auth_users
http_port 3128
“`

一个小问题

squid 默认只允许代理 443 端口的https流量,而会拒绝对其他端口的connect请求。需要更改配置文件

To fix this, add your port to the line in the config file:
acl SSL_ports port 443
so it becomes
acl SSL_ports port 443 4444
squid 默认还禁止了除了443之外的所有connect
deny CONNECT !SSL_Ports # 删掉这一句