[Python] 파이썬 웹 크롤링 기초 1 : Beautiful Soup

자바에 jsoup 라이브러리가 있듯이, 파이썬에는 Beautiful Soup 이라는 웹 크롤링 라이브러리가 있다.

웹 크롤링이란 간단히 설명하면, 웹 페이지 내용을 긁어오는 행위를 뜻한다.

Beautiful Soup 은 html이나 xml 형태의 문서에서 원하는 내용을 가져올 수 있도록 돕는다.

2022년 1월 현재 Beautiful Soup 의 버전은 4.9.0 버전이며 예제는 파이썬 2.7 과 Python 3.2 에서 똑같이 작동한다.

1. 설치

(1) 리눅스(ex : 우분투) OS 사용하는 경우

sudo apt-get install libxml2-dev libxslt-dev python-dev zliblg-dev

sudo apt-get install python-lxml

(파이썬 가상환경 사용하는 경우, “workon [가상환경명]” 입력)

pip install lxml

pip install beautifulsoup4

(2) 윈도우 OS 사용하는 경우

(파이썬 가상환경 사용하는 경우, “workon [가상환경명]” 입력)

pip install lxml

pip install beautifulsoup4

2. 특징

파이썬 Beautiful Soup 은 lxml, html5lib 파서를 이용하는 라이브러리로, html 문서 내용을 가져와서 정보를 파싱(분석)해주는 파서(Parser, 분석기) 역할을 한다.

웹페이지 내용을 가져올 때 인코딩을 유니코드로 변환해서 UTF-8로 출력해준다.

3. 라이브러리 문서

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

위 사이트의 내용을 토대로 개발하면 된다.

태그 객체 가져오기(find_all), 자식노드 가져오기(.children), 형제노드 가져오기(.next_sibling 와 .previous_sibling) 등 예제 코드가 자세히 나와있다.

4. 예제

test.py 파일을 작성하고 cmd 에서 python test.py 명령어로 실행해본다.

test.py

html_doc = “””<html><head><title>The Dormouse’s story</title></head>

<body>

The Dormouse’s story

Once upon a time there were three little sisters; and their names were

<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,

<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and

<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;

and they lived at the bottom of a well.

…

“””

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, ‘html.parser’)

# html 내용 출력하기

print(soup.prettify())

결과

<html>

<head>

<title>

The Dormouse’s story

</title>

</head>

<body>

The Dormouse’s story

Once upon a time there were three little sisters; and their names were

Elsie

</a>

Lacie

</a>

and

Tillie

</a>

;

and they lived at the bottom of a well.

…

</body>

</html>

만약 특정 웹페이지의 내용을 가져오고 싶다면 requests 모듈을 임포트해서 활용한다.

request 모듈을 임포트하려면 (파이썬 가상환경 등에서) pip install requests 명령어로 미리 설치해야 한다.

import requests

html = requests.get(‘https://google.com/‘)

html_doc = html.text

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, ‘html.parser’)

# html 내용 출력하기

print(soup.prettify())

5. 자주 사용하는 명령어

(1) 태그명으로 객체 가져오기

soup.태그명 으로 첫번째 객체 1개를 가져올 수 있다.

# title 태그 가져오기

soup.title

=> <title>The Dormouse’s story</title>

# 객체의 태그명 가져오려면 .name 사용

soup.title.name

=> ‘title’

# 객체의 내용 가져오려면 .string 사용

soup.title.string

=> “The Dormouse’s story”

# p 태그 가져오기

soup.p

=> The Dormouse’s story

# 객체의 클래스 가져오려면 [‘class’] 사용

soup.p[‘class’]

=> [‘title’]

# a 태그 가져오기

soup.a

=> <a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>

(2) find_all

find_all 함수는 가장 많이 사용하게 되는 명령어로, 여러개 태그 객체를 리스트 형태로 반환한다.

# a 태그 가져오기

soup.find_all(‘a’)

=> [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

# 정규표현식 활용해서 find_all 사용하기 첫번째

# b로 시작하는 모든 태그를 가져온다. 결과는 body 태그와 b 태그를 가져온다.

import re

soup.find_all(re.compile(“^b”))

=> [<body>

The Dormouse’s story

Once upon a time there were three little sisters; and their names were

<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,

<a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a> and

<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>;

and they lived at the bottom of a well.

…

</body>, The Dormouse’s story]

# 정규표현식 활용해서 find_all 사용하기 두번째

# b로 시작하는 모든 태그를 가져와서 태그명을 출력한다. 결과는 body 와 b 가 출력된다.

import re

for tag in soup.find_all(re.compile(“^b”)): print(tag.name)

=> body

# a 태그와 b 태그 둘 다 가져오기

soup.find_all([“a”, “b”])

=> [The Dormouse’s story, <a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

# id 값으로 태그 가져오기 (ex : id 값이 “link2″인 태그)

soup.find_all(id=’link2′)

=> [<a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>]

# 특정 어트리뷰트의 값으로 태그 가져오기 (ex : href 어트리뷰트 값이 “http://example.com/elsie“인 태그)

soup.find_all(attrs={“href”: “http://example.com/elsie“})

=> [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>]

# p 태그이면서 클래스 값이 “title”인 태그 가져오기

soup.find_all(“p”, class_=”title”)

=> [The Dormouse’s story]

관련글

https://blog.naver.com/bb_/222619390736

[Python] 파이썬 웹 크롤링 기초 2 : Scrapy