Python Project: Web Scraper



Basic Python Project Setup



cd ~/p3project/code
mkdir bundlescraper
cd bundlescraper
touch bundlescraper.py
chmod +x bundlescraper.py


• Inside file bundlescraper.py
#!/usr/bin/evn python3
print("Hello World")


# Run script
./bundlescraper.py


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Basic virtualenv Usage


• Refers to here for inital installation & setup of python virtual environment

cd ~/py3project/code/bundlescraper
touch requirements.txt


virtualenv -p python3 venv              # initiate
source venv/bin/activate                # activate
# (venv) is prepend to the prompt
deactivate                              # exit


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Start programming our prototype!



• Put the following in the first line of file bundlescraper.py

#!/usr/bin/env python3


# Install libraries
# Instead of 'pip install requests', do this
echo "requests" >> requirements.txt
pip install -r requirements.txt


# Try run each line in python shell first
import requests
url = "https://www.humblebundle.com/books/cloud-computing-books"
r = requests.get(url)       # get contents
r.status_code               # expect '200'
r.__dict__                  # examine contents
r.text                      # html document


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Parsing HTML with Python and Beautiful Soup (bs4)


• Make html human readable

# Intall library bs4
# from bash shell
echo "bs4" >> requirements.txt
pip install -r requirements.txt


# Try run each line in python shell first
soup = BeautifulSoup(response.text, 'html.parser')
soup.title
soup.p


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Inspecting DOM Elements


• Using the browser, hover to the thing that is of interest, right click and select “Inspect Element”
• Get the id / class of the element we want

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Design Data Structure


• Useful to do once you know what you want

# Design Data Structure
# - tier 1 name and price
# - product1
# - product2
# - tier 2 name and price
# - product1
# - product2


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Common Data Structure Operations


• Try them, to see if you've got the right design

# Try on editor and paste to python shell
tiers = {
    "tier1":{
        "price":500,
        "products":[
            "name1",
            "name2"
        ]
    },
    "tier2":{
        "price":500,
        "products":[
            "name1",
            "name2"
        ]
    }
}


# In python shell
>>> tiers
>>> tiers.keys()


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Python List Comprehensions



>>> [key.upper() for key in tier.keys()]
['TIER1''TIER2']


Index