本文共 3518 字,大约阅读时间需要 11 分钟。
感觉好久没写爬虫的,今天看了在b站浏览了一会儿,发现b站有很多东西可以爬取的,比如首页的排行榜,番剧感觉很容易找到数据来源的,所以就拿主页的番剧来练练手的。
爬取的网址:
https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1* 通过观察url的规律,去除一些不影响请求网站的url中的数据,得到url https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息,page最大值为153,感觉这次爬取的信息作用不大,不过还是把代码写出来了运行scrapy的main方法,无需每次scrapy crawl name
# -*- coding: utf-8 -*-#@Project filename:PythonDemo dramaMain.py#@IDE :IntelliJ IDEA#@Author :ganxiang#@Date :2020/03/02 0002 19:16from scrapy.cmdline import executeimport osimport syssys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(['scrapy','crawl','drama'])
编写的dramaSeries.py
# -*- coding: utf-8 -*-import scrapyimport jsonfrom ..items import DramaseriesItemclass DramaSpider(scrapy.Spider): name = 'drama' allowed_domains = ['https://api.bilibili.com/'] i =1 start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)] def parse(self, response): item =DramaseriesItem() drama =json.loads(response.text) data =drama['data'] data_list =data['list'] # print(data_list) for filed in data_list: item['number']=self.i item['badge']=filed['badge'] item['cover_img']=filed['cover'] item['index_show']=filed['index_show'] item['link']=filed['link'] item['media_id']=filed['media_id'] item['order_type']=filed['order_type'] item['season_id']=filed['season_id'] item['title']=filed['title'] print(self.i,item) self.i+=1 yield item self.i+=20
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DramaseriesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() number =scrapy.Field() badge =scrapy.Field() cover_img =scrapy.Field() index_show =scrapy.Field() link =scrapy.Field() media_id =scrapy.Field() order_type =scrapy.Field() season_id =scrapy.Field() title =scrapy.Field() pass
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlfrom openpyxl import Workbookfrom scrapy.utils.project import get_project_settingssettings = get_project_settings()class DramaseriesPipeline(object): excelBook =Workbook() activeSheet =excelBook.active file =['number','title','link','media_id','season_id','index_show','cover_img','badge'] activeSheet.append(file) def process_item(self, item, spider): files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']] self. activeSheet.append(files) self.excelBook.save('./drama.xlsx') return item
settings.py
打开# Obey robots.txt rulesROBOTSTXT_OBEY = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en',}ITEM_PIPELINES = { 'dramaSeries.pipelines.DramaseriesPipeline': 300,}
运行结果爬取了两千多字段,其实还可以爬很多的。
转载地址:http://ehqzi.baihongyu.com/