這是基本的爬蟲的筆記。其實現在有分兩種，一個是webcrawl，另一個是webscrapy。我今天只是分享基本的。

請先確定你有沒有安裝 BeautifulSoup　套件，如果沒有請用　pip install BeautifulSoup 來安裝。

基本爬網頁內容

1.　import module

import os
import requests
from bs4 import BeautifulSoup

2.　requests 去爬網頁

url = 'https://www.instagram.com/chenchih.test'
response = requests.get(url)

#response.text 會爬取所有網頁內容，且不好看的格式，我們可以改用html格式

3.soup轉換成html格式

方法１：default 是'html.parser'，也就是說有打跟沒打是一樣：BeautifulSoup(response.text, 'html.parser')
BeautifulSoup(response.text) 跟　BeautifulSoup(response.text, 'html.parser')　是一樣

soup=BeautifulSoup(response.text)
soup

方法２

soup=BeautifulSoup(response.text, 'lxml')
soup

叟尋相關 js 相關tag

scripts = soup.findAll('script', {'type': 'text/javascript'})
print(scripts)

叟尋相關 image or video 搭配selenium in Instagram

1. selenium and chrome driver 相關

import os, requests
from bs4 import BeautifulSoup
from selenium import webdriver
DRIVER_PATH = 'D:\\test\\selenium\\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
url ="https://www.instagram.com/p/CPhwo1RJHG7/"
driver.get(url)

2. 搜尋img or video or blob相關東西

方法1用class 方法

soup = BeautifulSoup(driver.page_source,'html.parser')
soup.find_all("a", href = True)[0]['href']
soup.find_all('div',{"class":"eLAPa kPFhm"})[0].find_all('img')[0]['src']

方法2用找img or video

soup = BeautifulSoup(driver.page_source,'html.parser')
videos = soup.findAll('video')
video

用 video就會爪出所有video東西如下面

用 video['src']就會爪出blob

'blob:https://www.instagram.com/47eae1ca-37b2-4cd5-94a3-e9a7ae20e4e9'

要爪images 也可以
videos = soup.findAll('video')

爬取網頁照片

1. selenium and chrome driver 相關

from selenium import webdriver
import time,requests, os
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
from bs4 import BeautifulSoup

2. 取網頁

url ='url page'
driver.get(url)

2. 爬取圖片

soup = BeautifulSoup(driver.page_source,'html.parser')
images=[]
for link in soup.find_all("img"): 
    if link.get("src").endswith(".jpg"): 
        images.append(link.get("src"))

Beautifulsoup Basic (Python)

基本爬網頁內容

1.　import module

2.　requests 去爬網頁

3.soup轉換成html格式

叟尋相關 js 相關tag

叟尋相關 image or video 搭配selenium in Instagram

1. selenium and chrome driver 相關

2. 搜尋img or video or blob相關東西

爬取網頁照片

1. selenium and chrome driver 相關

2. 取網頁

2. 爬取圖片

chenchih

Related Posts

Comments

基本爬網頁內容

1. import module

2. requests 去爬網頁

3.soup轉換成html格式

叟尋相關 js 相關tag

叟尋相關 image or video 搭配selenium in Instagram

1. selenium and chrome driver 相關

2. 搜尋img or video or blob相關東西

爬取網頁照片

1. selenium and chrome driver 相關

2. 取網頁

2. 爬取圖片

chenchih

Related Posts

[計算機網路 02] Network Standards

Command Line Tool - iTerm2

電商庫存管理的處理架構 - 樂觀鎖

Comments

1.　import module

2.　requests 去爬網頁