BeautifulSoup for Scraping

By pjain      Published July 27, 2020, 8:03 p.m. in blog AI-Analytics-Data   

==== BS4 Quick -- DOM Interpret

QS: Using BS4 to extract

  • Simple fetch HTML strip html elements and process as text

    from bs4 import BeautifulSoup soup = BeautifulSoup(html)

    kill all script and style elements

    for script in soup(["script", "style"]): script.extract() # rip it out

    get text

    text = soup.get_text()

    break into lines and remove leading and trailing space on each

    lines = (line.strip() for line in text.splitlines())

    break multi-headlines into a line each

    chunks = (phrase.strip() for line in lines for phrase in line.split(" "))

    drop blank lines

    text = '\n'.join(chunk for chunk in chunks if chunk) print(text)

Overview

see ~z/CMSScrape/BeautifulSoupEx/listurls.py

Setup and Quick Ex of soup object

  • Can work with Py2.7 $ pip install requests $ pip install beautifulsoup4

  • Extracting URL's from any website & make soup object # Boiler plate to get soup object from bs4 import BeautifulSoup import requests url = input("Website to extract the URL's from: ") r = requests.get(url) data = r.text soup = BeautifulSoup(data)

    SAMPLE of filtering soup to extract URL's

    for link in soup.find_all('a'): print(link.get('href'))

r os BS4 JSoup

  • https://github.com/vladv75/WebParsingServiceExample

BS4 HTML Filters Ref 101

  • Given a DOM tree, bs4 queries find_xx and comes back with results - this is essentially a query+filter process soup.a # all links including titles soup.find_all('tag') - matches all tag eg b tr input

  • Extracting URL's from any website for link in soup.find_all('a'): print(link.get('href'))

  • Search with single Class and multiple classes soup.find_all("tr", {"class":"abc"}) soup.find_all("tr", {"class": ["abc", "xyz"]}) //either class soup.select('div.A.B') # this is a css selector and BOTH .A .B classes

  • <img src

  • select the image tag itself with a CSS selector - finds tags with src attribute, inside a

    with the class preview.

    thumbnails = soup.select('div.preview img[src]') for thumbnail in thumbnails: url = thumbnail['src'] # You you only need the first match, then use select_one(): url = soup.select_one('div.preview img[src]')['src']

Example HTML Parsing

from BeautifulSoup import BeautifulSoup *
soup = BeautifulSoup(html)
container = soup.find('td', {'class':'bodytext'})
for hr_el in container.findAll('hr'):
    # &lt;b&gt;NAME&lt;/b&gt;&lt;br/&gt;ADDRESS_0&lt;br/&gt;ADDRESS_1&lt;br/&gt;Grading:&lt;b&gt;GRADE&lt;/b&gt;&lt;hr/&gt;
    text_parts = hr_el.findPreviousSiblings(text=True, limit=3)
    # ['Grading:', 'ADDRESS_1', 'ADDRESS_0']
    address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2)
    # [&lt;b&gt;GRADE&lt;/b&gt;, &lt;b&gt;NAME&lt;/b&gt;]
    grade = el_parts[0].string
    name = el_parts[1].string
    print name, address, grade
  • import urllib import urllib2 from StringIO import StringIO from urllib2 import urlopen f = urlopen('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html') p = f.read() d = StringIO(p) f.close() from BeautifulSoup import * a = BeautifulSoup(d).findAll('a') [x['href'] for x in a]

    Faster Parsing

    p = SoupStrainer('a') a = BeautifulSoup(d, parseOnlyThese=p) [x['href'] for x in a]

  • Find HTML node .. Consider the HTML scrap ..

    23 " C"

    from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(d) attrs = {'class':'nobr'} #List of matching criteria nobrs = soup.findAll(attrs=attrs) #matching nodes temperature = nobrs[3].span.string #Why [3]?? - this is first ne .. print temperature #outputs 23

EX: HTML 2 TEXT - EX Quick

  • s = gethtml(url) # eg by bs4

    h = html2text.HTML2Text() h.ignore_links = True h.ignore_images = True s = h.handle(s) s = h.unescape(s)

  • self.processor = html2text.HTML2Text()
    self.processor.ignore_emphasis = True
    self.processor.bypass_tables = True
    self.processor.ignore_links = True
    

def mk_plaintext(self): try: h = html2text.HTML2Text() h.ignore_images = True h.inline_links = False h.wrap_links = False h.unicode_snob = True # Prevents accents removing h.skip_internal_links = True h.ignore_anchors = True h.body_width = 0 h.use_automatic_links = True h.ignore_tables = True except html.parser.HTMLParseError as e: raise WrongHTML(e)

    return h.handle(self.mk_html())

BS4 PKJ Experiments in PY Web Scraping 101

EX0: Ask user to select url, file, etc..

    #url = input("Website to extract the URL's from: ")

EX0: BS4 html2text = Fetch http or file get ..

  • html2 text, boiler plate IMPORT etc
    # pip install requests beautifulsoup4 html2text
    
    # Following is boiler plate to get soup object
    from bs4 import BeautifulSoup
    import requests
    import html2text
    import re
    import json
    
    # for web url
    #r  = requests.get(url) #"http://" +
    # data = r.text
    
    # for file
    #file = "/Users/pjain/Google Drive/zToImport/HealthVImp/USNewsBest Hospitals for Cardiology &amp; Heart Surgery _ US News Best Hospitals.htm"
    file = "usnewshospitalsdiabtes.html"
    data = open(file)
    
    # Extracting URL's from any website &amp; make soup object
    links = []
    for link in soup.find_all('a'):
        links.append(link)
        # print(link.get('href'))

    print(links)

EX0: Homeless shelter - kind of works ..

    # url = 'http://www.suntopia.org/cupertino/ca/homeless_shelters.php?page=2'
    # container = soup.find('article')
    # for shelterdiv in container.findAll("div", {"id":"info2"}):
            # print(shelterdiv.get_text(), end="\n\n\n\n")
            # Convert to text for easier parsing
    #   print(h.handle(shelterdiv.prettify()), end="=================\n")

EX1: Extract top hospitals fromm usnews for DIA, Cardiac apps

  • listurls - z/CMSScrapeText/pyScrapeExamples/01pkjscraping/listurls.py

  • Base working for cardiac - prettify worked right away!

    soup = BeautifulSoup(data, "html.parser")
    
    # Sample of HTML parsing 
    h = html2text.HTML2Text()
    # h.ignore_links = True
    h.ignore_images = True
    # print h.handle("&lt;p&gt;Hello, &lt;a href='http://earth.google.com/'&gt;world&lt;/a&gt;!")
        #html = open("foobar.html").read()
        # print(html2text.html2text(shelterdiv))
    

    EX3: usnews parse

    p = re.compile("[(.)]((.))\n\n^(.+?)$\n\n.*$\n^([\d.]+)\/", re.IGNORECASE)

    BASEURL = "https://health.usnews.com" divs = []; i=0 f = open("usnewsendocrinologydiabeteshospitals.json","w+") for div in soup.select('div.flex-row'): print(div.get_text(), end="========\n\n") addr1match= re.match(r'(.)\, ([A-Z]{2}) ([\d-]+).',pat1match.group(3)) #Orlando, FL 32806-2093 divadd = { "title" : '"' + pat1match.group(1).strip() + '"', "url" : '"' + BASEURL+re.sub('[ \n]','',pat1match.group(2)) + '"', "city" : '"' + addrmatch.group(1) + '"', "state" : '"' + addrmatch.group(2) + '"', "zipcode": '"' + addrmatch.group(3) + '"', "rating" : float(pat1match.group(4))*5.0/100.0 } divs.append(divadd)

    f.write(json.dumps(divs, indent=4, separators=(',', ': ')) ) # sort_keys=True, f.close()

    """ BASEURL: https://health.usnews.com
    ###  [ North Shore University Hospital ](/best-hospitals/area/ny/north-shore-
    university-hospital-6212357)
    
    Manhasset, NY 11030-3816
    
    North Shore University Hospital in Manhasset, NY is not nationally ranked in
    any specialty.
    
    47.9/100
    
         Cardiology &amp; Heart Surgery Score
    
    ===== MULTILINE, DOTALL
    /\[(.*)\]\((.*)\)\n\n^(.+?)$\n\n.*$\n^([\d\.]+)\//
    
    1 title:    
    2 url: baseurl+..
    3 address:  city, state, zip
    4 rating: x/100  map x/100 -&gt; x/5
    
    description: 
    type: hospital/Cardiology &amp; Heart Surgery Score
    
    """
    
  • DIA category - As working basic prettify ~ MANUAL extraction as too much junk in prettified list

    p = re.compile("[(.)]((.))\n\n^(.+?)$\n\n.*$\n^([\d.]+)\/", re.IGNORECASE)

    BASEURL = "https://health.usnews.com" pat = r'.[(.)]((.))' # \s(.?+)? # patcompiled = re.compile(pat) addrpatcompiled = re.compile(r'(.)\, ([A-Z]{2}) ([\d-]+).') divs = []; i=0 f = open("usnewsendocrinologydiabeteshospitals.json","w+") for div in soup.select('div.flex-row'): text = h.handle(div.prettify()) # prettify - needed to clear out and generate specific layout as shown below .. print(text, end="========\n\n") #pat1match = re.match(pat, text, re.M|re.S) # (.+?) #.+?. ([\d.]+)\/) #print(pat1match) #print(text, "count:",pat1match.groups(), end="========\n\n")

JS in web pages & Dynamic/scrolling loading is a pain ..

  1. Solution: Select & Ctrl-C Export the contents of Google Chrome's DOM inspector

PKJ Finance BS4 Experiments - Finance.yahoo.com

  • Options Call Options Strike Symbol Last Chg Bid Ask Vol Open Int ... 190.0 AAPL130328C00350000 326.9 0 324.50 ..

  • Cmd line is good way to rapidly prototype ...

    from urllib.request import urlopen

    optionsUrl = 'http://finance.yahoo.com/q/op?s=AAPL+Options' optionsPage = urlopen(optionsUrl) from bs4 import BeautifulSoup soup = BeautifulSoup(optionsPage)

    soup.findAll(text='AAPL130328C00350000') # [u'AAPL130328C00350000'] # EXACT match used to track parent table! soup.findAll( text='AAPL130328C00350000')[0].parent #AAPL130328C00350000 - parent is href, it's parent is td and its ".parent is tr -> 110.00AAPL130328C003500001.25 0.000.901.051010

    optionsTable = [ # gather all rows the [x.text for x in y.parent.contents] for y in soup.findAll('td', attrs={'class': 'yfnc_h', 'nowrap': ''}) ]

  • TIP: find unique element: td tag with class attrs of BOTH 'yfnc_h' and empty 'nowrap' # y: 110.00 # y.parent - goes one level higher and get contents for each

  • TIP: Defensive scraping - since article written - the implementation is TOTALLY different including url and now react based!

    you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

Full BS EX and Lessons

BS4 Lessons

  • Lesson 1 - pagination is a nit .. here was able to set page size so it all fit in one page

  • Lesson 2 - The target page and the tutorial changed VERY FAST .. became

  • Lesson 3 - Lazy load messy .. simplified by inspecting and saving the outer HTML of dl containing

EX0: Snippets of html parsign

$ python
&gt;&gt;&gt; from bs4 import BeautifulSoup
&gt;&gt;&gt; html = """&lt;p&gt;&lt;img src="http://localhost:8080/images/foo.png" alt="foo"/&gt; adfadsf&lt;/p&gt;"""
&gt;&gt;&gt; soup = BeautifulSoup(html, "html.parser")
  • img >>> soup.find('img')['src'] # 'http://localhost:8080/images/foo.png'

  • p text >>> soup.find('p').text # ' adfadsf'

  • img

  • class selector filtering

    19

    rank = soup.find("div", {"class": "rank-box"}).h6.contents

EX1: - scraping company list for CeCItem

  • ~DevM/PythonM/00_PKJt/33_TextBS4/pnpfintech.json

  • Note lazy load messy .. simplified by inspecting and saving the outer HTML of dl containing

  • from bs4 import BeautifulSoup import unicodedata import re

    infile = open("./pnphealth.html","r", encoding="utf8") html = infile.read() soup = BeautifulSoup(html, "html.parser") # print soup.prettify() tL = []

    EXTRACT per dl >> li a cl{ ..

    dl = soup.find('dl', attrs={'class': 'plain-list'}) # print(dl.prettify()) for li in dl.findAll('li'): # img imgsrc = li.find('img',{'class':'responsive'})['src'] # co

    BriteHealth

    co = li.find('h4',{'class':'text-center'}).text

    # POPOVER -- &lt;div class="pnp-overlay-content pnp-popover-7123"&gt; &gt;&gt; details url, article-&gt; descr
    popover = li.find('div',{'class':'pnp-overlay-content'})
        # url      &lt;p&gt;&lt;a href="http://britehealth.co"&gt;britehealth.co&lt;/a&gt;&lt;/p&gt;
        url     = popover.find('a')['href'] if popover.find('a') else ""
        # descr    &lt;article class="pnp-popover-content-article"&gt;\n   Brite Health .. \n   &lt;/article&gt;
        desc    = popover.find('article').text
        desc = re.sub(r'\n *','',desc) # strip the \n    .. 
    cL = {'imgsrc': imgsrc, 'co': co,'url':url,'description':desc }
    #print(cL)
    if len(cL) &gt; 0:
        tL.append(cL)
    

    print(tL)

FullEX2: - scraping data table rows

~DevM/PythonM/00_PKJt/33_TextBS4/prisonscrape.py

SRC:https://first-web-scraper.readthedocs.io/en/latest/
$ pip install BeautifulSoup  # 4.4.1
$ pip install Requests       # 2.9.1
  • analyze HTML https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s

    ...
    ADAM OMER SIRAJ M B 29 COLUMBIA MO  Details

  • Fetch content of web page import requests

    url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' response = requests.get(url) html = response.content

    print html

  • Parse HTML from bs4 import BeautifulSoup

    soup = BeautifulSoup(html, "html.parser")

    print soup.prettify()

    table = soup.find('tbody', attrs={'class': 'stripe'})

    print table.prettify()

  • Extract Table rows tL = [] for row in table.findAll('tr'): #print row.prettify() cL = [] for cell in row.findAll('td'): text = cell.text.replace(' ', '') text= unicodedata.normalize('NFKD',text) cL.append(text) #print cL if len(cL)>0: tL.append(cL) #print tL

    outfile = open("./prisonscrape.csv", "w",encoding='utf8',newline='\n') writer = csv.writer(outfile) writer.writerows(tL)

Related modern example for table

# https://gist.github.com/phillipsm/0ed98b2585f0ada5a769
import requests
from bs4 import BeautifulSoup
url_to_scrape = 'http://apps2.polkcountyiowa.gov/inmatesontheweb/'

# Tell requests to retreive the contents our page (it'll be grabbing
# what you see when you use the View Source feature in your browser)
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text)
inmates_list = []
for table_row in soup.select("table.inmatesList tr"):
    cells = table_row.findAll('td')
    if len(cells) &gt; 0:
        first_name = cells[1].text.strip()
        last_name = cells[0].text.strip()
        age = cells[2].text.strip()
        inmate = {'first_name': first_name, 'last_name': last_name, 'age': age}
        inmates_list.append(inmate)
        print "Added {0} {1}, {2}, to the list".format(first_name, last_name, age)

    # What if we want to do more than just print out all the names and
    # ages? Maybe we want to filter things a bit. Say, only we want to 
    # only print out the inmates with an age between 20 and 30.
    # Let's keep track of the number of inmates in the 20s.
inmates_in_20s_count = 0
for inmate in inmates_list:
        # The age we originally received from BeautifulSoup is a string. 
        # We need it to be a number so that we can compare it easily. Let's make it an integer.
        age = int(inmate['age'])
        if age &gt; 19 and age &lt; 31:
            print "{0} {1}, {2} is in the 20 to 30 age range".format(inmate['first_name'], inmate['last_name'], age)
            inmates_in_20s_count = inmates_in_20s_count + 1

# How many inmates did we find in the page? Use the len funciton to find out.
print "Found {0} in the page".format(len(inmates_list))
print "Found {0} between age 20 and 30 in the page".format(inmates_in_20s_count)

Dumping JSON

  • JSON is more rigorous than CSV

    import json print(json.dumps(tL))

Write CSV output

  • Given a array of dict, can write import csv

    e = open("./pnphealth.csv", "w", encoding='utf8', newline='\n') writer = csv.writer(outfile)

    writer.writerow("imagesc,co,url,description")

    writer.writerows(tL)

Content Examples - img, href

  • Get a represenative (qualified) image for a linked post.. Reddit: https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py

Pagination

The items I need to scrape are paginated Sometimes you need to scrape items that aren't all available on one page. Luckily it's usually an easy fix. I suggest checking the URL when changing between the pages, you'll ussually find there is a page or offset parameter changing. This is easy to scrape with something like the following:

base_url = "http://some-url.com/something/?page=%s"
for url in [base_url % i for i in xrange(10)]:
r = requests.get(url) I'm getting 403 errors / rate limited

BS4 Ref

r BeautifulSoup Tool and Conceptual 101

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one.

Then you just have to specify the original encoding.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping crummy.com/software/BeautifulSoup/

GET data - Requests - fetching URL or file

  • You must provide (get) the file from disk or html/url. Recommened to use requests instead of urlib(x)
  • SEE: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree

Going down Navigating using tag names .contents and .children .descendants .string .strings and stripped_strings

Going up .parent .parents

Going sideways .next_sibling and .previous_sibling .next_siblings and .previous_siblings

Going back and forth .next_element and .previous_element .next_elements and .previous_elements

• Easy navigation & searching of tree

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application.

Multiple Parsers Support

• Parses HTML-ish documents

Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

HIGH-LEVEL Text/Content extraction

  • Extracting all the text from a page print(soup.get_text())

  • Extracting text from tag # SAMPLE of filtering soup to extract URL's for link in soup.find_all('a'): print(link.get('href'))

  • Accessing the content with get_text()

Regex

# Finds all tags start "b" - body,b
import red
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

# Find tags containing t - table tr td th ..
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
  • List (or) allow a string match against any item in that list.

    Finds all 'a' and 'b' tags

    print soup.find_all(["a", "b"])

  • True - find all tags

    for tag in soup.find_all(True): print(tag.name)

  • User defined Function

    REGEX: find all for id
    ..

    t=soup.findAll('div', id=re.compile('^test-')

    WITH FUNCTION

    print soupHandler.findAll('div', id=lambda x: x and x.startswith('test-'))

Selector based filtering

  • soup is the parsed html .. soup.find_all("title") soup.find_all("p", "title") soup.find_all("a") soup.find_all(id="link2")

  • filtering by class

    rt=soup.find('table', class_='wikitable sortable plainrowheaders')

Extract table to Data Frame

  • Extracting state capital names and attributes like image etc. Here the tree for table is odd - second element of is within tag not so custom parsing is done => there are upto 6 cells in row so we make 6 lists for cells, and one extra for state name. #"https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" rt=soup.find('table', class_='wikitable sortable plainrowheaders')

    Generate lists

    A=[] B=[] C=[] D=[] E=[] F=[] G=[] for row in rt.findAll("tr"): cells = row.findAll('td') states=row.findAll('th') #To store second column data if len(cells)==6: #Only extract table body not heading A.append(cells[0].find(text=True)) B.append(states[0].find(text=True)) C.append(cells[1].find(text=True)) D.append(cells[2].find(text=True)) E.append(cells[3].find(text=True)) F.append(cells[4].find(text=True)) G.append(cells[5].find(text=True))

    import pandas to convert list to data frame

    import pandas as pd df=pd.DataFrame(A,columns=['Number']) df['State/UT']=B df['Admin_Capital']=C df['Legislative_Capital']=D df['Judiciary_Capital']=E df['Year_Capital']=F df['Former_Capital']=G print df

Debugging soup. printing elements

print soup.prettify() # nested structure of HTML page
print soup.title  # &lt;title&gt;Python For Beginners&lt;/title&gt;
print soup.title.string  # u'Python For Beginners'

Sort Sample - DiskCacheFetcher

import md5, os, tempfile, time

#fetcher = DiskCacheFetcher('/path/to/cache/directory')
#    fetcher.fetch('http://developer.yahoo.com/', 60)

class DiskCacheFetcher:
    def __init__(self, cache_dir=None):
        # If no cache directory specified, use system temp directory
        if cache_dir is None:
            cache_dir = tempfile.gettempdir()
        self.cache_dir = cache_dir
    def fetch(self, url, max_age=0):
        # Use MD5 hash of the URL as the filename
        filename = md5.new(url).hexdigest()
        filepath = os.path.join(self.cache_dir, filename)
        if os.path.exists(filepath):
            if int(time.time()) - os.path.getmtime(filepath) &lt; max_age:
                return open(filepath).read()
        # Retrieve over HTTP and cache, using rename to avoid collisions
        data = urllib.urlopen(url).read()
        fd, temppath = tempfile.mkstemp()
        fp = os.fdopen(fd, 'w')
        fp.write(data)
        fp.close()
        os.rename(temppath, filepath)
        return data

0 comments

There are no comments yet

Add new comment

Similar posts

Scrapy for Scraping

BeautifulSoup for Scraping

Web Scraping in Python - Overview

App Store Marketing