Python Project: Web Scraper
Basic Python Project Setup
cd ~/p3project/code
mkdir bundlescraper
cd bundlescraper
touch bundlescraper.py
chmod +x bundlescraper.py
• Inside file bundlescraper.py
#!/usr/bin/evn python3
print("Hello World")
# Run script
./bundlescraper.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Basic virtualenv Usage
• Refers to here for inital installation & setup of python virtual environment
cd ~/py3project/code/bundlescraper
touch requirements.txt
virtualenv -p python3 venv # initiate
source venv/bin/activate # activate
# (venv) is prepend to the prompt
deactivate # exit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Start programming our prototype!
• Put the following in the first line of file bundlescraper.py
# Install libraries
# Instead of 'pip install requests', do this
echo "requests" >> requirements.txt
pip install -r requirements.txt
# Try run each line in python shell first
import requests
url = "https://www.humblebundle.com/books/cloud-computing-books"
r = requests.get(url) # get contents
r.status_code # expect '200'
r.__dict__ # examine contents
r.text # html document
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parsing HTML with Python and Beautiful Soup (bs4)
• Make html human readable
# Intall library bs4
# from bash shell
echo "bs4" >> requirements.txt
pip install -r requirements.txt
# Try run each line in python shell first
soup = BeautifulSoup(response.text, 'html.parser')
soup.title
soup.p
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Inspecting DOM Elements
• Using the browser, hover to the thing that is of interest, right click and select “Inspect Element”
• Get the id / class of the element we want
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Design Data Structure
• Useful to do once you know what you want
# Design Data Structure
# - tier 1 name and price
# - product1
# - product2
# - tier 2 name and price
# - product1
# - product2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Common Data Structure Operations
• Try them, to see if you've got the right design
# Try on editor and paste to python shell
tiers = {
"tier1":{
"price":500,
"products":[
"name1",
"name2"
]
},
"tier2":{
"price":500,
"products":[
"name1",
"name2"
]
}
}
# In python shell
>>> tiers
>>> tiers.keys()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Python List Comprehensions
>>> [key.upper() for key in tier.keys()]
['TIER1', 'TIER2']
Index