小说下载脚本(写小说脚本)

网友投稿 385 2022-09-05


小说下载脚本(写小说脚本)

================================

工具准备:

================================

下载与 chome 浏览器版本一致的 chromedriver, chromedriver 国内下载镜像

chromedriver.exe 复制到 python 的scripts目录中, 比如 C:\Anaconda3\Scripts\

并将C:\Anaconda3\Scripts\加到Windows 环境变量PATH 中.

================================

安装 selenium python 包

================================

pip install selenium

================================

selenium  的更多信息

================================

selenium 不仅支持Python, 还支持Java/C#

是一个不断完善的过程,  所以, 最后一个下载脚本是最通用, 最完美的.

================================

根据章节序号推算单章url地址, 然后下载

================================

from selenium import webdriverweb = webdriver.Chrome()full_text="小说:穿越种田之将门妻"full_text=full_text+"\n" +"\n" +"\n"home_url=" #39start_page_id=0for i in range(chapter_start,chapter_end+1): page_id=i+start_page_id url=home_url+str(page_id)+".html" #print("第"+str(i)+"章") full_text=full_text+"\n" +"\n" +"\n" +"======================"+"\n"+"第"+str(i)+"章"+ "\n" web.get(url) #

content_tag = web.find_element_by_id("content") #content_tag = web.find_element_by_class_name("panel panel-default panel-readcontent") content = content_tag.text full_text=full_text+contentprint(full_text)web.close()

================================

从列表也提取单章url, 然后下载单章文本

================================

#========================================# 方法1: 数字转中文, 有缺陷,比如: 10将转成一零#========================================def num_to_char(num): """数字转中文""" num=str(num) new_str="" num_dict={"0":u"零","1":u"一","2":u"二","3":u"三","4":u"四","5":u"五","6":u"六","7":u"七","8":u"八","9":u"九"} listnum=list(num) # print(listnum) shu=[] for i in listnum: # print(num_dict[i]) shu.append(num_dict[i]) new_str="".join(shu) # print(new_str) return new_str#========================================# 方法2: 数字转中文, 比较完美#========================================# -------------------------------------------------------------------------------# Name: num2chinese# Author: yunhgu# Date: 2021/8/24 14:51# Description:# -------------------------------------------------------------------------------_MAPPING = (u'零', u'一', u'二', u'三', u'四', u'五', u'六', u'七', u'八', u'九',)_P0 = (u'', u'十', u'百', u'千',)_S4, _S8, _S16 = 10 ** 4, 10 ** 8, 10 ** 16_MIN, _MAX = 0, 9999999999999999class NotIntegerError(Exception): passclass OutOfRangeError(Exception): passclass Num2Chinese: def convert(self, number: int): """ :param number: :return:chinese number """ return self._to_chinese(number) def _to_chinese(self, num): if not str(num).isdigit(): raise NotIntegerError(u'%s is not a integer.' % num) if num < _MIN or num > _MAX: raise OutOfRangeError(u'%d out of range[%d, %d)' % (num, _MIN, _MAX)) if num < _S4: return self._to_chinese4(num) elif num < _S8: return self._to_chinese8(num) else: return self._to_chinese16(num) @staticmethod def _to_chinese4(num): assert (0 <= num < _S4) if num < 10: return _MAPPING[num] else: lst = [] while num >= 10: lst.append(num % 10) num = num // 10 lst.append(num) c = len(lst) # 位数 result = u'' for idx, val in enumerate(lst): if val != 0: result += _P0[idx] + _MAPPING[val] if idx < c - 1 and lst[idx + 1] == 0: result += u'零' return result[::-1].replace(u'一十', u'十') def _to_chinese8(self, num): assert (num < _S8) to4 = self._to_chinese4 if num < _S4: return to4(num) else: mod = _S4 high, low = num // mod, num % mod if low == 0: return to4(high) + u'万' else: if low < _S4 // 10: return to4(high) + u'万零' + to4(low) else: return to4(high) + u'万' + to4(low) def _to_chinese16(self, num): assert (num < _S16) to8 = self._to_chinese8 mod = _S8 high, low = num // mod, num % mod if low == 0: return to8(high) + u'亿' else: if low < _S8 // 10: return to8(high) + u'亿零' + to8(low) else: return to8(high) + u'亿' + to8(low)#========================================# 从列表页提取单章url, 然后下载单章文本#========================================from selenium import webdriverweb = webdriver.Chrome()num2chinese = Num2Chinese()full_text="小说:掌家小娘子"full_text=full_text+"\n" +"\n" +"\n"print(full_text)list_url=" #306for i in range(chapter_start,chapter_end+1): chinese_chapter_id=num2chinese.convert(i) #中文数字 #chinese_chapter_id=str(i) #阿拉伯数字 chinese_chapter_name="第"+chinese_chapter_id+"章" if chinese_chapter_name.find("百十"): chinese_chapter_name=chinese_chapter_name.replace("百十", "百一十") #print(chinese_chapter_name) web.get(list_url) #跳转会列表页, 以便抓取单页的url地址 url="" try: url=web.find_element_by_partial_link_text(chinese_chapter_name).get_attribute("href") except: url="" #print(url) if url: web.get(url) #

#//*[@id="content"] #content_tag = web.find_elements_by_css_selector("dd")[2] #content_tag = web.find_element_by_id("contents") #content_tag = web.find_element_by_class_name("container body-content") content_tag = web.find_element_by_xpath('''//*[@id="center"]''') content = content_tag.text else: content="不提供下载" chapter_text = "\n" + "\n" + "\n" + "======================" + "\n" + "第" + str(i) + "章" + "\n" chapter_text=chapter_text+content print(chapter_text) full_text=full_text+chapter_text#print(full_text)web.close()

================================

每章支持多个分页

作了性能优化

自动输出到文件

增加番外篇下载

代码逻辑优化

================================

================================

根据正文内容 xpath 不固定

================================


版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:【樽海鞘算法】基于多子群的共生非均匀高斯变异樽海鞘群算法求解单目标优化问题附matlab代码MSNSSA
下一篇:修改python默认的编码方式(python3默认编码格式)
相关文章

 发表评论

暂时没有评论,来抢沙发吧~