Beautifulsoup Basic (Python)


Posted by chenchih on 2021-06-09

這是基本的爬蟲的筆記。其實現在有分兩種,一個是webcrawl,另一個是webscrapy。我今天只是分享基本的。

請先確定你有沒有安裝 BeautifulSoup 套件,如果沒有請用 pip install BeautifulSoup 來安裝。

基本爬網頁內容

1. import module

import os
import requests
from bs4 import BeautifulSoup

2. requests 去爬網頁

url = 'https://www.instagram.com/chenchih.test'
response = requests.get(url)

#response.text 會爬取所有網頁內容,且不好看的格式,我們可以改用html格式

3.soup轉換成html格式

方法1:default 是'html.parser',也就是說有打跟沒打是一樣:BeautifulSoup(response.text, 'html.parser')
BeautifulSoup(response.text) 跟 BeautifulSoup(response.text, 'html.parser') 是一樣

soup=BeautifulSoup(response.text)
soup

方法2

soup=BeautifulSoup(response.text, 'lxml')
soup

叟尋相關 js 相關tag

scripts = soup.findAll('script', {'type': 'text/javascript'})
print(scripts)

叟尋相關 image or video 搭配selenium in Instagram

1. selenium and chrome driver 相關

import os, requests
from bs4 import BeautifulSoup
from selenium import webdriver
DRIVER_PATH = 'D:\\test\\selenium\\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
url ="https://www.instagram.com/p/CPhwo1RJHG7/"
driver.get(url)

2. 搜尋img or video or blob相關東西

方法1用class 方法

soup = BeautifulSoup(driver.page_source,'html.parser')
soup.find_all("a", href = True)[0]['href']
soup.find_all('div',{"class":"eLAPa kPFhm"})[0].find_all('img')[0]['src']

方法2用找img or video

soup = BeautifulSoup(driver.page_source,'html.parser')
videos = soup.findAll('video')
video

video就會爪出所有video東西如下面

video['src']就會爪出blob

'blob:https://www.instagram.com/47eae1ca-37b2-4cd5-94a3-e9a7ae20e4e9'
  1. 要爪images 也可以
    videos = soup.findAll('video')

爬取網頁照片

1. selenium and chrome driver 相關

from selenium import webdriver
import time,requests, os
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
from bs4 import BeautifulSoup

2. 取網頁

url ='url page'
driver.get(url)

2. 爬取圖片

soup = BeautifulSoup(driver.page_source,'html.parser')
images=[]
for link in soup.find_all("img"): 
    if link.get("src").endswith(".jpg"): 
        images.append(link.get("src"))

#Python #crawl #bs4







Related Posts

[極短篇] SQL injection

[極短篇] SQL injection

筆記-JavaScript-event loop

筆記-JavaScript-event loop

[C#] 寫入檔名替換非法字元

[C#] 寫入檔名替換非法字元


Comments