使用Scrapy框架，爬取b站番剧信息。-白红宇

使用Scrapy框架，爬取b站番剧信息。

阅读量：3961 次

发布时间：2019-05-24

本文共 3518 字，大约阅读时间需要 11 分钟。

使用Scrapy框架，爬取b站番剧信息。

感觉好久没写爬虫的，今天看了在b站浏览了一会儿，发现b站有很多东西可以爬取的，比如首页的排行榜，番剧感觉很容易找到数据来源的，所以就拿主页的番剧来练练手的。

爬取的网址：

https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1*

通过观察url的规律，去除一些不影响请求网站的url中的数据，得到url

https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息，page最大值为153，感觉这次爬取的信息作用不大，不过还是把代码写出来了

运行scrapy的main方法，无需每次scrapy crawl name

# -*- coding: utf-8 -*-#@Project filename：PythonDemo  dramaMain.py#@IDE   ：IntelliJ IDEA#@Author ：ganxiang#@Date   ：2020/03/02 0002 19:16from scrapy.cmdline import executeimport osimport syssys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(['scrapy','crawl','drama'])

编写的dramaSeries.py

# -*- coding: utf-8 -*-import scrapyimport jsonfrom ..items import DramaseriesItemclass DramaSpider(scrapy.Spider):    name = 'drama'    allowed_domains = ['https://api.bilibili.com/']    i =1    start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)]    def parse(self, response):        item =DramaseriesItem()        drama =json.loads(response.text)        data =drama['data']        data_list =data['list']        # print(data_list)        for filed in data_list:            item['number']=self.i            item['badge']=filed['badge']            item['cover_img']=filed['cover']            item['index_show']=filed['index_show']            item['link']=filed['link']            item['media_id']=filed['media_id']            item['order_type']=filed['order_type']            item['season_id']=filed['season_id']            item['title']=filed['title']            print(self.i,item)            self.i+=1            yield item        self.i+=20

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DramaseriesItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    number =scrapy.Field()    badge =scrapy.Field()    cover_img =scrapy.Field()    index_show =scrapy.Field()    link =scrapy.Field()    media_id =scrapy.Field()    order_type =scrapy.Field()    season_id =scrapy.Field()    title =scrapy.Field()    pass

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlfrom openpyxl import Workbookfrom scrapy.utils.project import get_project_settingssettings = get_project_settings()class DramaseriesPipeline(object):    excelBook =Workbook()    activeSheet =excelBook.active    file =['number','title','link','media_id','season_id','index_show','cover_img','badge']    activeSheet.append(file)    def process_item(self, item, spider):        files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']]        self. activeSheet.append(files)        self.excelBook.save('./drama.xlsx')        return item

settings.py

打开

# Obey robots.txt rulesROBOTSTXT_OBEY = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = {
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  'Accept-Language': 'en',}ITEM_PIPELINES = {
      'dramaSeries.pipelines.DramaseriesPipeline': 300,}

运行结果爬取了两千多字段，其实还可以爬很多的。

转载地址：http://ehqzi.baihongyu.com/

你可能感兴趣的文章