GeorgeYang'Blog

my technology blog

一个python随机选择爬虫py文件进行爬取数据的程序

阅读:252 创建时间:16-06-11 11:15:42 tags:python,爬虫

最近做一个爬虫程序,要爬多个网站的信息,要支持增加爬虫py文件不需要重启程序的功能。 思路: 1.使用import函数动态导入模块 2.使用getattr函数实现反射机制动态加载方法 3.爬虫文件固定方法,调用器传一个class实例,该实例中提供一个方法,该方法爬虫程序写出数据

提前准备:

在项目目录下新建一个文件夹,该文件夹用于存放爬虫程序py文件,管理程序添加如下方法列出所有py文件:

 def getEffectFile():
     effectFiles = []
     path = os.path.split(os.path.realpath(__file__))[0]
     pattern = re.compile(r'^(?!_).*py$')#py后缀名
     for filename in os.listdir(path + '/scanner'):#爬虫目录
         match = pattern.match(filename)
         if match:
             effectFiles.append(filename)
     return effectFiles

步骤1、2的实现

 class Writter:
     def __init__(self,mongdb):
         self.mongdb = mongdb;

     def write(self,name,url,picture):
         #write > database
         pass


 def runPy(fileName,db):
     modelName = fileName
     controllerName = "startScan"#固定的方法名

     try:
         moduleList = __import__('scanner.' + modelName, {}, {}, [modelName])
     except Exception:
         print "no found"
     modelObj = getattr(moduleList, modelName)()

     if hasattr(modelObj, controllerName):
         try:
             result = getattr(modelObj, controllerName)(Writter(db))
             print "result:", result
         except Exception,e:
             exstr = traceback.format_exc()
             print exstr
     else:
         print "no found"

最后启动程序:

 effectFiles = []
 while True:
     db = None
     if db is None:
         db = MongoClient(127.0.0.1,1234)
     else:
         try:
             db.ping()
         except Exception:
             db = MongoClient(127.0.0.1,1234)
     flag = random.randint(0, 10)#1/10的机会重新列出所有py爬虫文件
     if effectFiles is None or len(effectFiles)==0 or flag==10:
         effectFiles = getEffectFile()
     print effectFiles
     # rIndex = random.randint(0, len(effectFiles)-1)
     fileName = random.choice(effectFiles)#batter than randint
     runPy(fileName[:-3],db)
     time.sleep(3600)
 db.close()

爬虫文件示范:

 /scanner/csdnhot.py

 class csdnhot:
     def startScan(self,writer):
         name = ....
         url = ...
         picture = ...
         writer.write(name,url,picture)

大功告成!尽情的爬取吧:)