Английская Википедия:Beautiful Soup (HTML parser)

Материал из Онлайн справочника
Перейти к навигацииПерейти к поиску

Шаблон:Short description Шаблон:Other usesШаблон:Primary sources Шаблон:Infobox software

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,[1] which is useful for web scraping.[2][3]

Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,[4] and is additionally supported by Tidelift, a paid subscription to open-source maintenance.[5]

Code example

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops.[6] The example below uses the Python standard library's urllib[7] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

History

Beautiful Soup is named both after a poem in Alice's Adventures in Wonderland[8] and tag soup.[9]

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x. Beautiful Soup 4 can be installed with pip install beautifulsoup4.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.[10]

See also

References

Шаблон:Reflist


Шаблон:Compu-library-stub

  1. Шаблон:Citation
  2. Ошибка цитирования Неверный тег <ref>; для сносок crummy.com не указан текст
  3. Шаблон:Cite web
  4. Шаблон:Cite web
  5. Шаблон:Cite web
  6. Шаблон:Cite web
  7. Шаблон:Cite web
  8. Шаблон:Cite web
  9. Шаблон:Cite web
  10. Шаблон:Cite web