Skip to content

Project 1: Web Scraper

Objectives:

  • Gather data from a web page using the requests library
  • Extract specific information from HTML using the BeautifulSoup library
  • Store extracted data in a structured format

Introduction:

Web scraping is the process of extracting data from websites. It involves sending HTTP requests to retrieve HTML content and then parsing the HTML to extract the desired information. Python provides powerful libraries for web scraping, making it a convenient tool for data collection from the web.

Step 1: Import Libraries:

Python

import requests
from bs4 import BeautifulSoup

Step 2: Send Request and Retrieve HTML:

Python

url = "https://www.example.com/data.html"
response = requests.get(url)
html_content = response.content

Step 3: Parse HTML with BeautifulSoup:

Python

soup = BeautifulSoup(html_content, "lxml")

Step 4: Extract Information:

Use BeautifulSoup’s methods to navigate the HTML structure and extract the desired information. For instance, to extract text from the first <p> element:

Python

first_paragraph = soup.find('p').text
print(first_paragraph)

Step 5: Store Extracted Data:

Save the extracted data in a structured format, such as a list or a dictionary. For instance, to store all paragraph texts in a list:

Python

paragraphs = []
for paragraph in soup.find_all('p'):
    paragraphs.append(paragraph.text)

print(paragraphs)

Additional Considerations:

  • Handle errors and exceptions gracefully.
  • Respect the website’s robots.txt file to avoid overloading the server.
  • Use a proxy or VPN to mask your IP address and avoid being blocked.

Summary:

Web scraping is a valuable technique for gathering data from the web. By utilizing Python libraries like requests and BeautifulSoup, you can effectively extract information from websites and store it in a structured format.

Example:

Let’s consider a simple example where we build a web scraper to extract headlines from a news website.

# Example Code
import requests
from bs4 import BeautifulSoup

def fetch_web_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to fetch the web page. Status Code: {response.status_code}")
        return None

def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

def extract_headlines(soup):
    headlines = soup.select('.headline')  # Example CSS selector
    return [headline.text.strip() for headline in headlines]

def main():
    target_url = 'https://example-news-website.com'
    html_content = fetch_web_page(target_url)

    if html_content:
        soup = parse_html(html_content)
        headlines = extract_headlines(soup)

        for index, headline in enumerate(headlines, start=1):
            print(f"{index}. {headline}")

if __name__ == "__main__":
    main()