How to scrape Weather.com for Weather Information using BeautifulSoup

A short and handy tutorial to build your own web scraper with Apify

·

8 min read

How to scrape Weather.com for Weather Information using BeautifulSoup

Are you planning to go out for this summer? Picking a date for the perfect summer day out would be difficult but running a simple data analysis of the weather of upcoming dates may save you the hustle. In this tutorial, we will scrape and extract data from the famous weather forecasting website weather.com using Python and Beautifulsoup library, and deploy it to the Apify Platform.

What is Apify?

Apify is a web scraping platform that provides SDKs and pre-built templates for common web scraping tasks. It also provides convenient storage functions, making it easy to handle scraped data.

Prerequisites

  • Python 3.9+

  • Basic knowledge of web scraping.

Setting Up

Even though we can write our web scraper from scratch using Apify SDK, Apify has a few pre-built templates for an easier start. They include every file we need to scrape a web page and are very easy to customize, which helps us save time and prevent headaches.

These templates also include the Actor Life-style management feature, which helps us to focus on the core logic of the scraper, without much concern on the setup and teardown processes. We can also use the template’s pre-built Input schema file to define and easily validate a schema for our Actor’s input.

So here, we’ll be using the Apify Python starter template which uses BeautifulSoup to scrape a single page with the input URL.

Start with Python - Template · Apify

While we can use the Apify console itself to edit, build, and run the code, it would be more convenient for us to write the program locally and then deploy it to Apify.

To use the template locally, first install the Apify CLI and create a new actor using the template as below:

$ npm -g install apify-cli

Running this command will build a virtual environment for the Actor and install the required dependencies.

When it’s all set up, we can see 2 Python scripts main.py and main.py. The functions of these 2 scripts are as follows:

  • main.py: This file contains the main function for the Apify actor. It’s where the web scraping task is defined and executed.

  • _main_.py: This file is the entry point for executing the Apify Actor. It sets up necessary settings and runs the main function in the main.py script asynchronously.

What is an Actor?

Just like a human actor, apify actors accept input, perform their job, and generate output, according to the script.

Understanding the code

Let’s take a closer look at the scripts of our newly created actor.

Main.py

# Beautiful Soup - library for pulling data out of HTML and XML files, read more at
# https://www.crummy.com/software/BeautifulSoup/bs4/doc
from bs4 import BeautifulSoup

# HTTPX - library for making asynchronous HTTP requests in Python, read more at https://www.python-httpx.org/
from httpx import AsyncClient

# Apify SDK - toolkit for building Apify Actors, read more at https://docs.apify.com/sdk/python
from apify import Actor


async def main() -> None:
    async with Actor:
        # Structure of input is defined in input_schema.json
        actor_input = await Actor.get_input() or {}
        url = actor_input.get('url')

        # Create an asynchronous HTTPX client
        async with AsyncClient() as client:
            # Fetch the HTML content of the page.
            response = await client.get(url, follow_redirects=True)

        # Parse the HTML content using Beautiful Soup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract all headings from the page (tag name and text)
        headings = []
        for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
            heading_object = {'level': heading.name, 'text': heading.text}
            Actor.log.info(f'Extracted heading: {heading_object}')
            headings.append(heading_object)

        # Save headings to Dataset - a table-like storage
        await Actor.push_data(headings)

This script uses the following libraries:

  1. BeautifulSoup: A popular Python web scraping library that can be used to extract data in a hierarchical and more readable manner.

  2. Httpx: Httpx is a http client built for Python 3, that provides sync and async APIs. AsyncClient is used for making asynchronous HTTP requests in Python, which means a batch of requests doesn’t wait for each request to finish before making the next.

  3. Apify library

The sample main() function fetches the HTML content from the input URL, parses it using BeautifulSoup, and then extracts the web page headings, finally storing it in datasets.

The storage function raises an important concern about deploying these web scrapers in Apify. When run locally, the output of the web scraper is stored as separate lines in individual JSON files, which means if there are 20 outputs, there would be 20 JSON Files. This makes it difficult to acquire data for data analysis.

But when the web scraper is deployed and run on Apify, the dataset is stored in the cloud and can be accessed and downloaded through the Apify app or its API. The data can be downloaded in different formats such as CSV, JSON, Excel, and even as an RSS feed, whichever fits our requirements. Therefore, once we finish creating the program, we will be deploying it to Apify.

_main_.py

import asyncio
import logging

from apify.log import ActorLogFormatter

from .main import main

# Configure loggers
handler = logging.StreamHandler()
handler.setFormatter(ActorLogFormatter())

apify_client_logger = logging.getLogger('apify_client')
apify_client_logger.setLevel(logging.INFO)
apify_client_logger.addHandler(handler)

apify_logger = logging.getLogger('apify')
apify_logger.setLevel(logging.DEBUG)
apify_logger.addHandler(handler)

# Execute the Actor's main coroutine
asyncio.run(main())

This script uses basic libraries such as async and logging to run asynchronous functions and to configure logging settings for apify actor respectively.

As a summary, this script configures the 2 loggers— apify-logger and apify-client-logger — and runs the main.py script asynchronously, which allows the script to perform multiple tasks simultaneously.

Customizing the Code

Now that we have a basic understanding of the code in the template, we can customize it according to our needs.

As mentioned in the first part of the tutorial, we will be scraping data from weather.com to extract weather information for the next 10-13 days. To do so, we should first edit the input URL of the input_schema.json file to the URL of the 10-Day weather Page of weather.com.

input_schema.json file

(To find the URL, go to weather.com > 10-Day weather. The URL depends on the location of the IP address. If you want to extract weather information for a specific place, enter the zip code on the website and proceed.

Then we should edit our main.py file to extract data with relevant classes. For this tutorial, we will extract the data about the weather and the temperature.

If you take a close look at the main.py file in the template, we can observe how headings are extracted through HTML tags. But since we are looking out for specific information, we can’t use HTML tags; Instead, we should use class names assigned to those elements.

Here’s how customized code that fits our requirements would look like:

from bs4 import BeautifulSoup
from httpx import AsyncClient
from apify import Actor

async def main() -> None:
    async with Actor:
        actor_input = await Actor.get_input() or {}
        url = actor_input.get('url')

        async with AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)

        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract weather data for 10 days
        weather_data = []
        for day in soup.find_all('div', class_='Disclosure--SummaryDefault--2XBO9'):
            day_data = {}
            # Extract day name
            day_name = day.find(attrs={"data-testid": "daypartName"})             day_data['day_name'] = day_name.text if day_name else "N/A"

            # Extract weather, temperature, wind speed, and UV index
            weather = day.find('span', class_='DetailsSummary--extendedData--307Ax')
            day_data['weather'] = weather.text if weather else "N/A"
            #Extract temperature
            temperature = day.find('span', class_='DetailsSummary--highTempValue--3PjlX')
            day_data['temperature'] = temperature.text if temperature else "N/A"

            Actor.log.info(f'Extracted weather data: {day_data}')
            weather_data.append(day_data)

        # Save weather data to Dataset
        await Actor.push_data(weather_data)

Modifications done to the template:

  • Data is extracted via class names, and stored in the list weather_data

  • Datasets will have 3 columns: The Day, Weather, and the temperature. If needed, you can extract data about wind speed, UV index, humidity, etc. in the same manner.

  • If certain information doesn’t exist, that data field is filled with “N/A”, to prevent potential errors and for an easier data cleaning process in the data analysis.

Now our program is complete. We can move to the next step by deploying it to Apify.

Deploying the code to Apify

To deploy and execute this code on Apify, you should create an Apify account first and get your default API token for logging in.

  • Go to apify.com > sign up and create a free account.

  • In the Apify console, go to settings > Integrations and copy your default API token.

  • Copy the API token

Once you have the token, type apify login on the terminal, and then paste your token.

C:\Users\user\my-actor>apify login

After signing in, enter apify push on the terminal and we’re good to go!

Success: you are logged in to Apify as buzzpy!
C:\Users\nethm\my-actor>apify push

To view and run our deployed code, go to console.apify.com > Actors. You can view your deployed actor there. To run, click the “Start” button and let Apify do the rest.

Image: Output of the actor

Now you can easily download the output data, clean the N/A values, and analyze.

Next Steps

And that’s all about it. Read Apify Python SDK Documentation to expand your knowledge on Python web scraping with Apify and if you want to try out more, read this tutorial on scraping LinkedIn on Apify Blog.