實作 - Python 實作網路爬蟲 ( web crawler )

Created 2019-06-23| Updated 2021-01-16|實作

[實作] Python 實作網路爬蟲 ( web crawler )

基本流程

連線到特定網址，抓取資料
解析資料，取得實際想要的部分

抓取資料

盡可能地，讓程式模仿一個普通使用者的樣子，因為許多網站不希望人家用程式去抓取他們的資料
必須包含 Headers

import urllib.request as req
#建立一個 Request 物件，附加 Request Headers 的資訊
url = "網址"
request = req.Request(url, headers = {
    "User-Agent":"需要的資訊" #到網頁 → F12 → Network → 通常是最上面的那個 → Headers → Request Headers → user-agent
})

with req.urlopen(request) as response:
    data = response.read().decode("utf-8")

解析資料

JSON 格式
- 使用內建的 JSON 模組來解析
HTML 格式
- 使用第三方套件 BeautifulSoup 來解析
- 安裝 BeautifulSoup
  - 使用 pip 套件管理工具 ( 安裝python 時，就一起裝了 )
    1
    $ pip install beautifulsoup4

import bs4
root = bs4.BeautifulSoup(data, "html.parser") #解析html
print(root.title.string) #抓 title 標籤底下的 string

titles = root.find("div", class_ = "title") #尋找 class = "title" 的 div 標籤
print(titles.a.string)

titles = root.find_all("div", class_ = "title") #尋找所有 class = "title" 的 div 標籤
for title in titles:
    if title.a != None:
        print(title.a.string)

tags: `實作` `Python` `網路爬蟲` `BeautifulSoup`

Author: Kenny Li

Link: https://kennyliblog.nctu.me/2019/06/23/Python-web-crawler/

Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

實作 Python 網路爬蟲 BeautifulSoup

Recommend

問題 - BeautifulSoup 解析後無法存入 MySQL

實作 - Flask 網站開發並部署上 Heroku

實作 - Python 連結至 MySQL 並進行操作

實作 - 讓 Google 能搜尋到自己的 Hexo Blog

實作 - Hexo Blog 架設

實作 - 同一部電腦管理 Github 與 Gitlab SSH keys