proxy

海外爬虫 IP 池

https://github.com/constverum/ProxyBroker/blob/master/proxybroker/providers.py

https://list.proxylistplus.com/SSL-List-1

https://list.proxylistplus.com/Fresh-HTTP-Proxy-List-1

https://cool-proxy.net/
https://github.com/imWildCat/scylla/blob/master/scylla/providers/coolproxyprovider.py

https://free-proxy-list.net/
https://github.com/imWildCat/scylla/blob/master/scylla/providers/freeproxylist_provider.py

https://proxyhttp.net/
https://github.com/imWildCat/scylla/blob/master/scylla/providers/httpproxyprovider.py

https://www.ipaddress.com/proxy-list/
https://github.com/imWildCat/scylla/blob/master/scylla/providers/ipaddress_provider.py

http://proxy-list.org/english/index.php
https://github.com/imWildCat/scylla/blob/master/scylla/providers/proxylistprovider.py

https://raw.githubusercontent.com/sunny9577/proxy-scraper/master/proxies.json
https://github.com/imWildCat/scylla/blob/master/scylla/providers/proxyscraperprovider.py

http://www.proxylists.net/countries.html
https://github.com/imWildCat/scylla/blob/master/scylla/providers/proxylists_provider.py

https://github.com/imWildCat/scylla/blob/master/scylla/providers/proxynova_provider.py

http://pubproxy.com/api/proxy?limit=5&format=txt&type=http&level=anonymous&lastcheck=60&nocountry=CN

https://github.com/imWildCat/scylla/blob/master/scylla/providers/rmccurdy_provider.py

https://github.com/imWildCat/scylla/blob/master/scylla/providers/spysmeprovider.py

https://github.com/imWildCat/scylla/blob/master/scylla/providers/spysoneprovider.py

https://github.com/imWildCat/scylla/blob/master/scylla/providers/thespeedXprovider.py

https://proxy-daily.com/

http://ab57.ru/downloads/proxyold.txt

http://www.proxylists.net/http.txt

http://www.proxylists.net/http_highanon.txt

http://pubproxy.com/api/proxy?limit=5&format=txt&type=http&level=anonymous&lastcheck=60&nocountry=CN
http://pubproxy.com/api/proxy?limit=5&format=txt&type=http&level=anonymous&last_check=60&country=CN

http://free-proxy.cz/zh/proxylist/country/CN/all/ping/all
https://github.com/phpgao/proxypool/blob/master/job/htmlcz.go

http://nntime.com/proxy-updated-01.htm
https://github.com/phpgao/proxypool/blob/master/job/htmlnntime.go

https://premproxy.com/list/time-01.htm
https://github.com/phpgao/proxypool/blob/master/job/htmlpremproxy.go

https://github.com/phpgao/proxypool/blob/master/job/htmlproxydb.go

https://github.com/phpgao/proxypool/blob/master/job/htmlsite_digger.go

https://github.com/phpgao/proxypool/blob/master/job/htmlultraproxies.go

https://github.com/phpgao/proxypool/blob/master/job/htmlus_proxy.go

https://github.com/phpgao/proxypool/blob/master/job/jsoncool_proxy.go

https://github.com/phpgao/proxypool/blob/master/job/realiveproxy.go

https://github.com/phpgao/proxypool/blob/master/job/reblackhat.go

https://github.com/phpgao/proxypool/blob/master/job/redogdev.go

https://github.com/phpgao/proxypool/blob/master/job/refreeip.go

https://github.com/phpgao/proxypool/blob/master/job/rehttptunnel.go

https://github.com/phpgao/proxypool/blob/master/job/remy_proxy.go

https://github.com/phpgao/proxypool/blob/master/job/renewproxy.go

https://github.com/phpgao/proxypool/blob/master/job/reproxyiplist.go

https://github.com/phpgao/proxypool/blob/master/job/reproxylist.go

https://github.com/phpgao/proxypool/blob/master/job/rexseo.go

https://github.com/derekhe/ProxyPool/blob/master/lib/proxybroker/providers.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/listendeprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/nordprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/pdbprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/plpprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/premprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/sslprovider.py

https://github.com/Jiramew/spoon/blob/master/spoonserver/proxy/webprovider.py

https://www.freeproxy.world/

http://proxydb.net/

http://www.xsdaili.cn/

https://github.com/bluet/proxybroker2/blob/master/proxybroker/providers.py

https://github.com/nicksherron/proxi/blob/master/internal/providers.go

# def freeProxy10():
#     """
#     墙外网站 cn-proxy
#     :return:
#     """
#     urls = ['http://cn-proxy.com/', 'http://cn-proxy.com/archives/218']
#     request = WebRequest()
#     for url in urls:
#         r = request.get(url, timeout=10)
#         proxies = re.findall(r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>[\w\W]<td>(\d+)</td>', r.text)
#         for proxy in proxies:
#             yield ':'.join(proxy)

# @staticmethod
# def freeProxy11():
#     """
#     https://proxy-list.org/english/index.php
#     :return:
#     """
#     urls = ['https://proxy-list.org/english/index.php?p=%s' % n for n in range(1, 10)]
#     request = WebRequest()
#     import base64
#     for url in urls:
#         r = request.get(url, timeout=10)
#         proxies = re.findall(r"Proxy\('(.*?)'\)", r.text)
#         for proxy in proxies:
#             yield base64.b64decode(proxy).decode()

# @staticmethod
# def freeProxy12():
#     urls = ['https://list.proxylistplus.com/Fresh-HTTP-Proxy-List-1']
#     request = WebRequest()
#     for url in urls:
#         r = request.get(url, timeout=10)
#         proxies = re.findall(r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>[\s\S]*?<td>(\d+)</td>', r.text)
#         for proxy in proxies:
#             yield ':'.join(proxy)

爬虫 IP 封禁与反封禁

爬虫如果只用固定的同一个或者同一组 IP 的话,很容易被封禁,轻者弹验证码,重者直接无法访问。

这里主要探讨如何构架一个代理 IP 池,从而能够频繁更换代理 IP。

按照代理 IP 的来源,主要有几个方法:

  1. 去免费代理的网站上爬
  2. 利用 ADSL 重拨会更换 IP 的原理,使用 ADSL 机器搭建集群
  3. 利用云提供商的 API,自动更换 IP

反爬虫的核心在于区分开正常用户访问和恶意爬虫用户。来源 IP 是访问很重要的一个特征,我们可以从来源 IP 的角度来做出不少反爬虫策略。

  • 是否是代理 IP
  • 是否是民用 IP
  • IP 地理信息

一般来说,大规模的爬虫我们都会放到服务器上去跑,搭建代理集群也会在服务器上,而正常用户的 IP 地址则来自家用 IP 范围内。这就给反爬虫的一方提供了便利,对于来自数据中心的请求可以直接限制访问甚至直接屏蔽掉,而对于家用的 IP 地址则宽容一些。

下面我们来看几个实例

直接爬取网站

一般正常用户的页面访问量很小,如果发现某个 IP 的访问量特别大,那么肯定是爬虫,直接封禁即可,或者每次都需要输入验证码访问。

IP 被封禁后一般不会被解封,或者需要很长时间,这时候只有两种思路,要么降低频率,更改自己的行为特征,避免被封,要么更换 IP。一般来说,不管怎样更改自己的行为,访问量还是很难降下来的,这时候只能换一个 IP 继续爬。

使用代理网站提供的代理 IP

一些黑客会使用端口扫描器扫描互联网上的开放代理,然后免费或者付费提供给其他用户使用,比如下面这些网站:

免费代理

但是这些网站的代理中能直接使用的可能不到 10%,而且失效时间很短。所以要使用这些代理 IP,需要首先爬取这些网站,然后随取随用。

利用 ADSL 服务器更换 IP

网上有一些小的厂商代理了各地运营商的服务,搭建了一些小的服务器,一般内存只有 512M,而硬盘只有 8G,但是好处是通过 ADSL 上网,因此可以随时更换 IP。比如笔者搭建的这个动态代理:

ADSL

每三十分钟更换一次 IP,而这些服务器也很便宜,在 100-200 每月,所以大可以搭建一个集群,这样基本上一个 IP 被封之前也基本被换掉了。

要封禁这种用户也很简单,可以看出虽然 IP 在更换,但是基本上还是在一个 B 段之内,一个 B 段也就 6w 个用户,直接封了就行了

首先找到一个靠谱的 ADSL 网站就实属不易,这些 adsl 提供商的技术水平普遍不高,往往只能提供 centos 镜像,有 centos 7.1 就算不错的了,其中有一家竟然提供了 ubuntu 14.04,结果还是各种问题,坑了我大概半天的时间。

利用数据中心提供的更换 IP 接口来

有些爬虫会利用阿里云或者 AWS 的弹性 IP 来爬数据,反爬虫的第一步可以把阿里云的 IP 都屏蔽掉,正常用户一般是不会用这些 IP 来访问的。

附录

阿里云的出口 IP 列表:

deny 42.96.128.0/17;
deny 42.120.0.0/16;
deny 42.121.0.0/16;
deny 42.156.128.0/17;
deny 110.75.0.0/16;
deny 110.76.0.0/19;
deny 110.76.32.0/20;
deny 110.76.48.0/20;
deny 110.173.192.0/20;
deny 110.173.208.0/20;
deny 112.74.0.0/16;
deny 112.124.0.0/16;
deny 112.127.0.0/16;
deny 114.215.0.0/16;
deny 115.28.0.0/16;
deny 115.29.0.0/16;
deny 115.124.16.0/22;
deny 115.124.20.0/22;
deny 115.124.24.0/21;
deny 119.38.208.0/21;
deny 119.38.216.0/21;
deny 119.42.224.0/20;
deny 119.42.242.0/23;
deny 119.42.244.0/22;
deny 120.24.0.0/14;
deny 120.24.0.0/16;
deny 120.25.0.0/18;
deny 120.25.64.0/19;
deny 120.25.96.0/21;
deny 120.25.108.0/24;
deny 120.25.110.0/24;
deny 120.25.111.0/24;
deny 121.0.16.0/21;
deny 121.0.24.0/22;
deny 121.0.28.0/22;
deny 121.40.0.0/14;
deny 121.42.0.0/18;
deny 121.42.0.0/24;
deny 121.42.64.0/18;
deny 121.42.128.0/18;
deny 121.42.192.0/19;
deny 121.42.224.0/19;
deny 121.196.0.0/16;
deny 121.197.0.0/16;
deny 121.198.0.0/16;
deny 121.199.0.0/16;
deny 140.205.0.0/16;
deny 203.209.250.0/23;
deny 218.244.128.0/19;
deny 223.4.0.0/16;
deny 223.5.0.0/16;
deny 223.5.5.0/24;
deny 223.6.0.0/16;
deny 223.6.6.0/24;
deny 223.7.0.0/16;
101.200.0.0/15 
101.37.0.0/16 
101.37.0.0/17 
101.37.0.0/24 
101.37.128.0/17 
103.52.196.0/22 
103.52.196.0/23 
103.52.196.0/24 
103.52.198.0/23 
106.11.0.0/16 
106.11.0.0/17 
106.11.0.0/18 
106.11.1.0/24 
106.11.128.0/17 
106.11.32.0/22 
106.11.36.0/22 
106.11.48.0/21 
106.11.56.0/21 
106.11.64.0/19 
110.173.192.0/20 
110.173.196.0/24 
110.173.208.0/20 
110.75.0.0/16 
110.75.236.0/22 
110.75.239.0/24 
110.75.240.0/20 
110.75.242.0/24 
110.75.243.0/24 
110.75.244.0/22 
110.76.0.0/19 
110.76.21.0/24 
110.76.32.0/20 
110.76.48.0/20 
112.124.0.0/16 
112.125.0.0/16 
112.126.0.0/16 
112.127.0.0/16 
112.74.0.0/16 
112.74.0.0/17 
112.74.116.0/22 
112.74.120.0/22 
112.74.128.0/17 
112.74.32.0/19 
112.74.64.0/22 
112.74.68.0/22 
114.215.0.0/16 
114.55.0.0/16 
114.55.0.0/17 
114.55.128.0/17 
115.124.16.0/22 
115.124.20.0/22 
115.124.24.0/21 
115.28.0.0/16 
115.29.0.0/16 
118.190.0.0/16 
118.190.0.0/17 
118.190.0.0/24 
118.190.128.0/17 
118.31.0.0/16 
118.31.0.0/17 
118.31.0.0/24 
118.31.128.0/17 
119.38.208.0/21 
119.38.216.0/21 
119.38.219.0/24 
119.42.224.0/20 
119.42.242.0/23 
119.42.244.0/22 
119.42.248.0/21 
120.24.0.0/14 
120.24.0.0/15 
120.25.0.0/18 
120.25.104.0/22 
120.25.108.0/24 
120.25.110.0/24 
120.25.111.0/24 
120.25.112.0/23 
120.25.115.0/24 
120.25.136.0/22 
120.25.64.0/19 
120.25.96.0/21 
120.27.0.0/17 
120.27.128.0/17 
120.27.128.0/18 
120.27.192.0/18 
120.55.0.0/16 
120.76.0.0/15 
120.76.0.0/16 
120.77.0.0/16 
120.78.0.0/15 
121.0.16.0/21 
121.0.24.0/22 
121.0.28.0/22 
121.196.0.0/16 
121.197.0.0/16 
121.198.0.0/16 
121.199.0.0/16 
121.40.0.0/14 
121.42.0.0/18 
121.42.0.0/24 
121.42.128.0/18 
121.42.17.0/24 
121.42.192.0/19 
121.42.224.0/19 
121.42.64.0/18 
123.56.0.0/15 
123.56.0.0/16 
123.57.0.0/16 
139.129.0.0/16 
139.129.0.0/17 
139.129.128.0/17 
139.196.0.0/16 
139.196.0.0/17 
139.196.128.0/17 
139.224.0.0/16 
139.224.0.0/17 
139.224.128.0/17 
140.205.0.0/16 
140.205.128.0/18 
140.205.192.0/18 
140.205.32.0/19 
140.205.76.0/24 
182.92.0.0/16 
203.107.0.0/24 
203.107.1.0/24 
203.209.224.0/19 
218.244.128.0/19 
223.4.0.0/16 
223.5.0.0/16 
223.5.5.0/24 
223.6.0.0/16 
223.6.6.0/24 
223.7.0.0/16 
39.100.0.0/14 
39.104.0.0/14 
39.104.0.0/15 
39.104.0.0/24 
39.106.0.0/15 
39.108.0.0/16 
39.108.0.0/17 
39.108.0.0/24 
39.108.128.0/17 
39.96.0.0/13 
39.96.0.0/14 
39.96.0.0/24 
42.120.0.0/16 
42.121.0.0/16 
42.156.128.0/17 
42.96.128.0/17 
45.113.40.0/22 
45.113.40.0/23 
45.113.40.0/24 
45.113.42.0/23 
47.92.0.0/14 
47.92.0.0/15 
47.92.0.0/24 
47.94.0.0/15

squid proxy

使用 squid 做代理集群

Install squid

plain old apt-get update && apt-get install squid3 apache2-utils -y

Basic squid conf

/etc/squid3/squid.conf instead of the super bloated default config file

# note that on ubuntu 16.04, use squid instead of squid3
auth_param basic program /usr/lib/squid3/basic_ncsa_auth /etc/squid3/passwords
auth_param basic realm proxy
acl authenticated proxy_auth REQUIRED
http_access allow authenticated
forwarded_for delete
http_port 0.0.0.0:3128

Please note the basic_ncsa_auth program instead of the old ncsa_auth

Setting up a user

sudo htpasswd -c /etc/squid3/passwords username_you_like, on 16.04, it’s squid, not squid3
and enter a password twice for the chosen username then
sudo service squid3 restart

see: https://stackoverflow.com/questions/3297196/how-to-set-up-a-squid-proxy-with-basic-username-and-password-authentication

on centos

I have to use centos, since adsl providers are not capable of providing ubuntu

check out this wonderful article: https://hostpresto.com/community/tutorials/how-to-install-and-configure-squid-proxy-on-centos-7/

yum install -y epel-release
yum install -y squid
yum install -y httpd-tools
systemctl start squid
systemctl enable squid
touch /etc/squid/passwd && chown squid /etc/squid/passwd
htpasswd -c /etc/squid/passwd root

edit /etc/squid/squid.conf

auth_param basic program /usr/lib64/squid/basic_ncsa_auth /etc/squid/passwd
auth_param basic children 5
auth_param basic realm Squid Basic Authentication
auth_param basic credentialsttl 2 hours
acl auth_users proxy_auth REQUIRED
http_access allow auth_users
http_port 3128

一个小问题

squid 默认只允许代理 443 端口的 https 流量,而会拒绝对其他端口的 connect 请求。需要更改配置文件

To fix this, add your port to the line in the config file:
acl SSLports port 443
so it becomes
acl SSL
ports port 443 4444
squid 默认还禁止了除了 443 之外的所有 connect
deny CONNECT !SSL_Ports # 删掉这一句