Project 1: Web Scraper
Objectives:
- Gather data from a web page using the
requests
library - Extract specific information from HTML using the
BeautifulSoup
library - Store extracted data in a structured format
Introduction:
Web scraping is the process of extracting data from websites. It involves sending HTTP requests to retrieve HTML content and then parsing the HTML to extract the desired information. Python provides powerful libraries for web scraping, making it a convenient tool for data collection from the web.
Step 1: Import Libraries:
Python
import requests
from bs4 import BeautifulSoup
Step 2: Send Request and Retrieve HTML:
Python
url = "https://www.example.com/data.html"
response = requests.get(url)
html_content = response.content
Step 3: Parse HTML with BeautifulSoup:
Python
soup = BeautifulSoup(html_content, "lxml")
Step 4: Extract Information:
Use BeautifulSoup’s methods to navigate the HTML structure and extract the desired information. For instance, to extract text from the first <p>
element:
Python
first_paragraph = soup.find('p').text
print(first_paragraph)
Step 5: Store Extracted Data:
Save the extracted data in a structured format, such as a list or a dictionary. For instance, to store all paragraph texts in a list:
Python
paragraphs = []
for paragraph in soup.find_all('p'):
paragraphs.append(paragraph.text)
print(paragraphs)
Additional Considerations:
- Handle errors and exceptions gracefully.
- Respect the website’s robots.txt file to avoid overloading the server.
- Use a proxy or VPN to mask your IP address and avoid being blocked.
Summary:
Web scraping is a valuable technique for gathering data from the web. By utilizing Python libraries like requests
and BeautifulSoup
, you can effectively extract information from websites and store it in a structured format.
Example:
Let’s consider a simple example where we build a web scraper to extract headlines from a news website.
# Example Code
import requests
from bs4 import BeautifulSoup
def fetch_web_page(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
print(f"Failed to fetch the web page. Status Code: {response.status_code}")
return None
def parse_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
return soup
def extract_headlines(soup):
headlines = soup.select('.headline') # Example CSS selector
return [headline.text.strip() for headline in headlines]
def main():
target_url = 'https://example-news-website.com'
html_content = fetch_web_page(target_url)
if html_content:
soup = parse_html(html_content)
headlines = extract_headlines(soup)
for index, headline in enumerate(headlines, start=1):
print(f"{index}. {headline}")
if __name__ == "__main__":
main()