[Python] 博客來網路書店新書排行榜程式修正

博客來網路書店網站，近期對流量作了一些嚴謹的管制，使得本書原來採用一次下載新書排行榜所有資料的方式，程式執行到一半就會中斷，並發生錯誤。所以在這裡我們將縮小爬取的範圍，以指定分類的方式進行下載，避免出現這個問題。

以下是我們改用每次下載指定『分類』的方式，您可以手動設定您想要下載的分類，程式中分類的編號是 kindno，例如： kindno=1 代表『文學小說』類。

<books.py> 程式碼修正，最重要的是原 67 列 kind=href.text #分類，改為第 66 列方式，採用索引方式取得分類名稱。

 66      kind=hrefs[kindno-1].text #分類
 67  #    kind=href.text #分類

並加入第70 列，控制到第幾分類結束。

54  kindno=1  # 計算共有多少分類 
 …
70      if kindno==3: break  # 開發階段

只要調整第 54 和第 70 列，就可以設定顯示的分類，例如：下列設定將會顯示第 1~2 分類，也就是 文學小說、商業理財 類。

54 kindno=1 # 計算共有多少分類 
…
70 if kindno==3: break # 開發階段

例如：下列設定將會顯示第 3~4 分類，也就是 藝術設計、文人史地 類。

54  kindno=3  # 計算共有多少分類 
 …
70 if kindno==5: break  # 開發階段

<books.py> 修正的程式碼

 1   def showbook(url,kind):
 2       html = requests.get(url).text
 3       soup = BeautifulSoup(html,'html.parser') 
 4       try:
 5           pages=int(soup.select('.cnt_page span')[0].text)  # 該分類共有多少頁
 6           print("共有",pages,"頁")
 7           for page in range(1,pages+1):
 8               pageurl=url + '&page=' + str(page).strip()
 9               print("第",page,"頁",pageurl)
 10              showpage(pageurl,kind)
 11      except:  # 沒有分頁的處理
 12          showpage(url,kind)        
 13  
 14  def showpage(url,kind):
 15      html = requests.get(url).text
 16      soup = BeautifulSoup(html,'html.parser') 
 17      #近期新書、在 class="mod type02_m012 clearfix" 中
 18      res = soup.find_all('div',{'class':'mod type02_m012 clearfix'})[0]
 19      items=res.select('.item')  # 所有 item
 20      n=0  # 計算該分頁共有多少本書
 21      for item in items:
 22          msg=item.select('.msg')[0] 
 23          src=item.select('a img')[0]["src"]
 24          title=msg.select('a')[0].text  #書名
 25          imgurl=src.split("?i=")[-1].split("&")[0] #圖片網址
 26          author=msg.select('a')[1].text #作者
 27          publish=msg.select('a')[2].text #出版社
 28          date=msg.find('span').text.split("：")[-1] #出版日期
 29          onsale=item.select('.price .set2')[0].text #優惠價
 30          content=item.select('.txt_cont')[0].text.replace(" ","").strip()  #內容
 31          print("\n分類：" + kind)  
 32          print("書名：" + title)   
 33          print("圖片網址：" + imgurl)  
 34          print("作者：" + author)      
 35          print("出版社：" + publish)  
 36          print("出版日期：" + date) 
 37          print(onsale) # 優惠價 
 38          print("內容：" + content)     
 39          n+=1
 40          print("n=",n)
 41  #
 42  #        if n==2: break  # 開發階段
 43  
 44  def twobyte(kindno):
 45      if kindno<10:
 46          kindnostr="0"+str(kindno)
 47      else:
 48          kindnostr=str(kindno) 
 49      return kindnostr
 50  
 51  #主程式
 52  import requests
 53  from bs4 import BeautifulSoup 
 54  kindno=1  # 計算共有多少分類  
 55  homeurl = 'http://www.books.com.tw/web/books_nbtopm_01/?o=5&v=1'
 56  mode="?o=5&v=1" #顯示模式：直式  排序依：暢銷度
 57  url="http://www.books.com.tw/web/books_nbtopm_" 
 58  html = requests.get(homeurl).text
 59  soup = BeautifulSoup(html,'html.parser') 
 60  #中文書新書分類，算出共有多少分類
 61  res = soup.find('div',{'class':'mod_b type02_l001-1 clearfix'})
 62  hrefs=res.select("a")
 63  for href in hrefs:
 64      kindurl=url + twobyte(kindno) + mode # 分類網址  
 65      print("\nkindno=",kindno)  
 66      kind=hrefs[kindno-1].text #分類
 67  #    kind=href.text #分類
 68      showbook(kindurl,kind) # 顯示該分類所有書籍
 69      kindno+=1  
 70      if kindno==3: break  # 開發階段    
 71

相同的方式，也請修正<books_xlsx.py>程式碼。第 63 、67 列為修正的程式碼。執行時一樣請手動設定 51、67 列，設定要下載的分類。

51 kindno=1 # 計算共有多少分類 
…

67 if kindno==3: break # 開發階段

<books_xlsx.py> 修正的程式碼

 1   def showbook(url,kind):
 2       html = requests.get(url).text
 3       soup = BeautifulSoup(html,'html.parser') 
 4       try:
 5           pages=int(soup.select('.cnt_page span')[0].text)  # 該分類共有多少頁
 6           print("共有",pages,"頁")
 7           for page in range(1,pages+1):
 8               pageurl=url + '&page=' + str(page).strip()
 9               print("第",page,"頁",pageurl)
 10              showpage(pageurl,kind)
 11      except:  # 沒有分頁的處理
 12          showpage(url,kind)        
 13  
 14  def showpage(url,kind):
 15      html = requests.get(url).text
 16      soup = BeautifulSoup(html,'html.parser') 
 17      #近期新書、在 class="mod type02_m012 clearfix" 中
 18      res = soup.find_all('div',{'class':'mod type02_m012 clearfix'})[0]
 19      items=res.select('.item')  # 所有 item
 20      n=0  # 計算該分頁共有多少本書
 21      for item in items:
 22          msg=item.select('.msg')[0] 
 23          src=item.select('a img')[0]["src"]
 24          title=msg.select('a')[0].text  #書名
 25          imgurl=src.split("?i=")[-1].split("&")[0] #圖片網址
 26          author=msg.select('a')[1].text #作者
 27          publish=msg.select('a')[2].text #出版社
 28          date=msg.find('span').text.split("：")[-1] #出版日期
 29          onsale=item.select('.price .set2')[0].text #優惠價
 30          content=item.select('.txt_cont')[0].text.replace(" ","").strip()  #內容
 31          # 將資料加入 list1 串列中
 32          listdata=[kind,title,imgurl,author,publish,date,onsale,content]
 33          list1.append(listdata)
 34          n+=1
 35          print("n=",n)
 36  
 37  def twobyte(kindno):
 38      if kindno<10:
 39          kindnostr="0"+str(kindno)
 40      else:
 41          kindnostr=str(kindno) 
 42      return kindnostr
 43  
 44  #主程式
 45  import requests
 46  from bs4 import BeautifulSoup
 47  import openpyxl        
 48  workbook=openpyxl.Workbook()   #建立一個工作簿
 49  sheet = workbook.worksheets[0] #獲取工作表 
 50  list1=[]    
 51  kindno=1  # 計算共有多少分類  
 52  homeurl = 'http://www.books.com.tw/web/books_nbtopm_01/?o=5&v=1'
 53  mode="?o=5&v=1" #顯示模式：直式  排序依：暢銷度
 54  url="http://www.books.com.tw/web/books_nbtopm_" 
 55  html = requests.get(homeurl).text
 56  soup = BeautifulSoup(html,'html.parser') 
 57  #中文書新書分類，算出共有多少分類
 58  res = soup.find('div',{'class':'mod_b type02_l001-1 clearfix'})
 59  hrefs=res.select("a")
 60  for href in hrefs:
 61      kindurl=url + twobyte(kindno) + mode # 分類網址  
 62      print("\nkindno=",kindno)      
 63      kind=hrefs[kindno-1].text #分類
 64  #    kind=href.text #分類
 65      showbook(kindurl,kind) # 顯示該分類所有書籍
 66      kindno+=1
 67      if kindno==3: break  # 開發階段 
 68      
 69  # excel 資料
 70  listtitle=["分類","書名","圖片網址","作者","出版社","出版日期","優惠價","內容"]
 71  sheet.append(listtitle)  # 標題
 72  for item1 in list1: #資料
 73      sheet.append(item1) 
 74      
 75  workbook.save('books_all.xlsx')

完成之後，<books_all.xlsx>會儲存指定分類的書籍，再以手動合併方式即可得到所有的新書排行資料。

發佈留言