Beautiful Soup4.md

Beautiful Soup: even python2, but stil very useful

1. Module

from bs4 import BeautifulSoup
from pathlib import Path
import requests
import re

2. Request Url

url = "https://www.google.com/"
html = requests.get(url).content

3. Parse with BeautifulSoup

soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')

4. Find content you want(what's inside soup?)

Beautiful print

print(soup.prettify())

Four object types

Tag

soup.p # print a tag name `p`, tag could be title, head, a ...
# Ps: only print the first tag name `p`, See 4.3
soup.p.name # print the name of tag, `p`
soup.attrs # print attributes inside `p`
#Like: {'class': ['title'], 'name': 'dromouse'}

NavigableString

soup.p.string # print string inside `p` tag

BeautifulSoup

soup.name #think whole soup as a tag 
soup.attrs

Comment

It's a special navigableString (not very useful yet)

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

Navigator descendants

soup.p.contents # can print all children node of `p` tag
soup.p.children # Same function, but a listiterator
soup.p.descendants # print all descendants node of `p` tag as listiterator
# Also
soup.p.parent # father nodes
soup.p.parents
# Also 
soup.p.next_siblings / previous_siblings # brother nodes

Find_all

soup.find_all(['p', 'a'])  # output all tag named `p` or 'a'
soup.find_all(href=re.conpile("ICSD"), text="ARC file") # you can use regression
#you will got this: [<a href="./Helium (He) (ICSD 44396)_(PM7).html" target="_blank">ARC file</a>]
soup.find_all(id='links') # if there is a attribute name `id`