java中的接口是类吗
385
2022-09-05
小说下载脚本(写小说脚本)
================================
工具准备:
================================
下载与 chome 浏览器版本一致的 chromedriver, chromedriver 国内下载镜像
chromedriver.exe 复制到 python 的scripts目录中, 比如 C:\Anaconda3\Scripts\
并将C:\Anaconda3\Scripts\加到Windows 环境变量PATH 中.
================================
安装 selenium python 包
================================
pip install selenium
================================
selenium 的更多信息
================================
selenium 不仅支持Python, 还支持Java/C#
是一个不断完善的过程, 所以, 最后一个下载脚本是最通用, 最完美的.
================================
根据章节序号推算单章url地址, 然后下载
================================
from selenium import webdriverweb = webdriver.Chrome()full_text="小说:穿越种田之将门妻"full_text=full_text+"\n" +"\n" +"\n"home_url=" #39start_page_id=0for i in range(chapter_start,chapter_end+1): page_id=i+start_page_id url=home_url+str(page_id)+".html" #print("第"+str(i)+"章") full_text=full_text+"\n" +"\n" +"\n" +"======================"+"\n"+"第"+str(i)+"章"+ "\n" web.get(url) #
================================
从列表也提取单章url, 然后下载单章文本
================================
#========================================# 方法1: 数字转中文, 有缺陷,比如: 10将转成一零#========================================def num_to_char(num): """数字转中文""" num=str(num) new_str="" num_dict={"0":u"零","1":u"一","2":u"二","3":u"三","4":u"四","5":u"五","6":u"六","7":u"七","8":u"八","9":u"九"} listnum=list(num) # print(listnum) shu=[] for i in listnum: # print(num_dict[i]) shu.append(num_dict[i]) new_str="".join(shu) # print(new_str) return new_str#========================================# 方法2: 数字转中文, 比较完美#========================================# -------------------------------------------------------------------------------# Name: num2chinese# Author: yunhgu# Date: 2021/8/24 14:51# Description:# -------------------------------------------------------------------------------_MAPPING = (u'零', u'一', u'二', u'三', u'四', u'五', u'六', u'七', u'八', u'九',)_P0 = (u'', u'十', u'百', u'千',)_S4, _S8, _S16 = 10 ** 4, 10 ** 8, 10 ** 16_MIN, _MAX = 0, 9999999999999999class NotIntegerError(Exception): passclass OutOfRangeError(Exception): passclass Num2Chinese: def convert(self, number: int): """ :param number: :return:chinese number """ return self._to_chinese(number) def _to_chinese(self, num): if not str(num).isdigit(): raise NotIntegerError(u'%s is not a integer.' % num) if num < _MIN or num > _MAX: raise OutOfRangeError(u'%d out of range[%d, %d)' % (num, _MIN, _MAX)) if num < _S4: return self._to_chinese4(num) elif num < _S8: return self._to_chinese8(num) else: return self._to_chinese16(num) @staticmethod def _to_chinese4(num): assert (0 <= num < _S4) if num < 10: return _MAPPING[num] else: lst = [] while num >= 10: lst.append(num % 10) num = num // 10 lst.append(num) c = len(lst) # 位数 result = u'' for idx, val in enumerate(lst): if val != 0: result += _P0[idx] + _MAPPING[val] if idx < c - 1 and lst[idx + 1] == 0: result += u'零' return result[::-1].replace(u'一十', u'十') def _to_chinese8(self, num): assert (num < _S8) to4 = self._to_chinese4 if num < _S4: return to4(num) else: mod = _S4 high, low = num // mod, num % mod if low == 0: return to4(high) + u'万' else: if low < _S4 // 10: return to4(high) + u'万零' + to4(low) else: return to4(high) + u'万' + to4(low) def _to_chinese16(self, num): assert (num < _S16) to8 = self._to_chinese8 mod = _S8 high, low = num // mod, num % mod if low == 0: return to8(high) + u'亿' else: if low < _S8 // 10: return to8(high) + u'亿零' + to8(low) else: return to8(high) + u'亿' + to8(low)#========================================# 从列表页提取单章url, 然后下载单章文本#========================================from selenium import webdriverweb = webdriver.Chrome()num2chinese = Num2Chinese()full_text="小说:掌家小娘子"full_text=full_text+"\n" +"\n" +"\n"print(full_text)list_url=" #306for i in range(chapter_start,chapter_end+1): chinese_chapter_id=num2chinese.convert(i) #中文数字 #chinese_chapter_id=str(i) #阿拉伯数字 chinese_chapter_name="第"+chinese_chapter_id+"章" if chinese_chapter_name.find("百十"): chinese_chapter_name=chinese_chapter_name.replace("百十", "百一十") #print(chinese_chapter_name) web.get(list_url) #跳转会列表页, 以便抓取单页的url地址 url="" try: url=web.find_element_by_partial_link_text(chinese_chapter_name).get_attribute("href") except: url="" #print(url) if url: web.get(url) #
================================
每章支持多个分页
作了性能优化
自动输出到文件
增加番外篇下载
代码逻辑优化
================================
================================
根据正文内容 xpath 不固定
================================
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~