p0's blog | 破 关注网络安全
Celery+Redis简单实现分布式爬虫
发表于: | 分类: Python | 评论:0 | 阅读: 1649

Celery官方手册:http://docs.jinkan.org/docs/celery/
本文不再讲解Celery+Redis分布式环境搭建,具体参考官方手册

第一个程序:

tasks.py:
from celery import Celery

app = Celery('tasks', broker='redis://localhost//')

@app.task
def add(x, y):
    return x + y
add.py:
from tasks import add

def notify(a, b):
    result = add.delay(a, b)
    return result

if __name__ == '__main__':
print notify(6, 7)

第一个终端,运行 redis-server

第二个终端,运行 celery -A tasks worker --loglevel=info ,通过输出可以看到 celery 成功运行。

第三个终端,运行脚本 python add.py

注:为避免错误,将所有终端切换到当前文件所在目录

分布式实现爬取网络图片:

主要实现代码:
celery_config.py(celery配置)
from celery import Celery

app = Celery('celery_config',backend='redis://localhost:6379/0',broker='redis://localhost:6379/0',include=['spider'])
spider.py(分布式函数)
# !/usr/bin/python
# -*- coding: utf-8 -*-
from celery_config import app
import re
import requests
import urllib

@app.task
#获取页面URL
def getPageUrl():
    urllist = []
    for i in range(1,140):
        url = "http://www.umei.cc/p/gaoqing/rihan/"+str(i)+".htm"
        urllist.append(url)
    return urllist
@app.task
#获取图片集URL
def getUrl(html):
    reg = r'/p/gaoqing/rihan/2016.*?\.htm'
    urlre = re.compile(reg)  
    urllist = urlre.findall(html)  
    urllist = set(urllist)                                  #去重
    return urllist
@app.task
#从HTML匹配获取图片
def getImg(html):  
    reg = r'http://i1.umei.cc.*?\.jpg'  
    imgre = re.compile(reg)  
    imglist = imgre.findall(html)  
    imglist = set(imglist)#去重
    return imglist
run.py(内置调用函数,运行在redis服务器)
# !/usr/bin/python
# -*- coding: utf-8 -*-
from spider import getPageUrl
from spider import getUrl
from spider import getImg
import requests
import threading
x = 0
#run函数
def Run():
    pageurls = getPageUrl.delay().get()
    for i in range(1,len(pageurls)):
        html = requests.get(pageurls[i]).text
        urllist = getUrl.delay(html).get()
        for url in urllist:
            url = 'http://umei.cc'+url
            print url
            ImgHtml = requests.get(url).text
            imglist = getImg.delay(ImgHtml).get()
            for imgurl in imglist: 
                print imgurl
                urllib.urlretrieve(imgurl,'/test/%s.jpg' % x)  
                global x
                x += 1
if __name__ == '__main__':
    Run()

具体运行机制,一图胜千言:

程序框图.png

为分布式加入多线程:

只修改run.py
# !/usr/bin/python
# -*- coding: utf-8 -*-
from spider import getPageUrl
from spider import getUrl
from spider import getImg
import requests
import threading
x = 0
#run函数
def Run(url):
    html = requests.get(url[0]).text
    urllist = getUrl.delay(html).get()
    for url in urllist:
        url = 'http://umei.cc'+url
        print url
        ImgHtml = requests.get(url).text
        imglist = getImg.delay(ImgHtml).get()
        for imgurl in imglist: 
            print imgurl
            urllib.urlretrieve(imgurl,'/test/%s.jpg' % x)  
            global x
            x += 1
#加入多线程
def main():
    pageurls = getPageUrl.delay().get()
    urlgroup = []
    j = 0
    for i in range(0,len(pageurls)):
        if (j + 1) % 2 == 0:
            urlgroup.append(pageurls[i])
            t = threading.Thread(target=Run,args=(urlgroup,))
            t.start()
            t.join(1)
            urlgroup = []
        else:
            urlgroup.append(pageurls[i])
        j = j + 1
if __name__ == '__main__':
    main()
  • 如有错误之处请指正
  • 转载请注明出处

著作权归作者所有。
商业转载请联系作者获得授权,非商业转载请注明出处。
作者:p0
链接:http://p0sec.net/index.php/archives/58/
来源:http://p0sec.net/

添加新评论

TOP