爬蟲實作PTT-NBA版

2月 06, 2020

爬蟲實作PTT-NBA版

Ptt實戰：

import requests

from bs4 import BeautifulSoup

import time

today = time.strftime('%m/%d').lstrip('0')#m代表月份，d代表日期，但這個月份的回傳值會有０，但是ptt板上的月份是沒有０的，

# 他會將字串左邊的文字給移除，輸入的文字是要移除的內容

print(today)

def pttNBA(url):

resp = requests.get(url)

if resp.status_code != 200:

print('URL發生錯誤：'+ url)

return

200 的意思：
如何查看我們是否有成功取得網頁的資訊

我們可以印出resp.status_code 取得網頁的狀態碼，來得知此網頁是否有成功收到請求，並且是否為正常狀態。
常見的狀態碼：200表示正常、404表示找不到網頁等…可見HTTP狀態碼。

soup = BeautifulSoup(resp.text, 'html5lib')#將網頁的內容傳給beautifulSoup解析

paging = soup.find('div','btn-group btn-group-paging').find_all('a')[1]

#取得網頁元素的第一步是取得上一頁的連結，

因為上頁是第二個，所以要加索引值[1]，再加上[]可以取得href超連結文字

articles = []

rents = soup.find_all('div','r-ent')

for rent in rents:

title = rent.find('div','title').text.strip()#.strip去掉空白字元

count = rent.find('div','nrec').text.strip()#推文數

date = rent.find('div','meta').find('div','date').text.strip()#因為他多包了一層標籤，所以會需要用兩次find

article = '%s %s:%s' % (date, count, title)

try:

if today == date and int(count) > 10:

articles.append(article)

except:

if today == date and count == ‘爆’:#因為在轉的時候可能會有爆跟Ｘ，但是把它轉成整數的話會出錯，所以要用這個方式

才不會爆bug

articles.append(article)

if len(articles) != 0:

for article in articles:

print(article)

pttNBA('https://www.ptt.cc' + paging)

else:

return

pttNBA('https://www.ptt.cc/bbs/NBA/index6508.html’)#最後用這個來抓資料

以上是抓取今天所發的文章的推文數，標題及日期

那以下是我修改程式，改成抓取前5頁的文章資料：

import requests

from bs4 import BeautifulSoup

def pttNBA(url):

resp = requests.get(url)

if resp.status_code != 200:

print('URL發生錯誤：'+ url)

return

soup = BeautifulSoup(resp.text, 'html5lib')#將網頁的內容傳給beautifulSoup解析

paging = soup.find('div','btn-group btn-group-paging').find_all('a')[1]['href']

#取得網頁元素的第一步是取得上一頁的連結，因為上頁是第二個，所以要加索引值[1]，再加上[]可以取得href超連結文字

articles = []

rents = soup.find_all('div','r-ent')

for rent in rents:

title = rent.find('div','title').text.strip()#.strip去掉空白字元

count = rent.find('div','nrec').text.strip()#推文數

date = rent.find('div','meta').find('div','date').text.strip()#因為他多包了一層標籤，所以會需要用兩次find

article = '%s %s:%s' % (date, count, title)

if paging != '/bbs/NBA/index6502.html':

articles.append(article)

if len(articles) != 0:

for article in articles:

print(article)

pttNBA('https://www.ptt.cc/'+paging)

else:

return

pttNBA('https://www.ptt.cc/bbs/NBA/index6507.html')

那這次修改的程式，因為題目要求前五頁，於是我發現paging那邊的index是會隨著頁數而改變的，發現如此之後，我就利用這個性質，限制程式只會抓到6502，抓到後就停止按上一頁，於是如此我就可以抓到前五頁的所有資料

搜尋此網誌

Justin's 學習日誌

爬蟲實作PTT-NBA版

留言

張貼留言

熱門文章

Python筆記（＊檔案輸入與輸出）