When using Web Scraper to collect data, we often find that scraping single pages is time-consuming and inefficient. Therefore, using multi-page data scraping can significantly improve efficiency. This article will explain how to use Web Scraper to scrape multi-page data, hoping to provide some inspiration.
We have learned that there are several methods for scraping multi-page data: using pagination navigation, using automated pagination looping, and calling API interfaces. Among these, the most common method is using pagination navigation, while the most recommended approach is automated pagination looping. Below, we will discuss the steps for each method.
Using Pagination Navigation
Pagination navigation is the most common method. Websites use pagination to display large amounts of data, with each page having a unique URL. By analyzing these URLs to identify patterns, we can use a for loop to iterate from the first page to the last page, generate and request each URL, and extract the corresponding data. Some web pages do not display specific page numbers but instead require clicking a "Next" button to navigate. In such cases, we analyze the HTML element information, simulate user clicks, and loop until the button becomes unclickable (indicating the last page), ending the loop.
Below is a snippet of the core logic (Python):
Python
base_url = "https://example.com/products?page="
max_pages = 5 # Set the maximum number of pages to crawl
for page in range(1, max_pages + 1):
url = f"{base_url}{page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('.product-item')
for item in items:
title = item.select_one('.title').text.strip()
price = item.select_one('.price').text.strip()
print(f"Page {page}: {title} - {price}")
Using Automated Pagination Looping
Automated pagination looping is the most efficient way to collect data. Simply put, it involves using scripts or automation tools to implement automatic page-turning for data retrieval. Below is an example using Puppeteer to achieve this functionality.
Java
const puppeteer = require('puppeteer');
async function scrapeMultiPages() {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
const results = [];
let currentPage = 1;
const maxPages = 5; // A maximum of 5 pages can be retrieved.
await page.goto('https://example.com/list?page=1');
while (currentPage <= maxPages) {
// Extract data from the current page.
const data = await page.evaluate(() =>
Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('h3')?.textContent || '',
link: el.querySelector('a')?.href || ''
}))
);
results.push(...data);
console.log(`Successfully fetched page ${currentPage}, a total of ${data.length} items.`);
// Jump to the next page.
currentPage++;
try {
await page.goto(`https://example.com/list?page=${currentPage}`);
} catch {
break;
}
}
console.log(`Completed! A total of ${results.length} pieces of data were captured.`);
await browser.close();
}
scrapeMultiPages();
Calling API Interfaces
APIs can be inspected in the Network tab of developer tools. Identify the API interface that returns paginated data, find the pattern, and sequentially request data for each page until an empty array is returned, ending the loop. The data is typically returned in JSON format.
The above methods can help solve most problems. Additionally, it is important to simulate human behavior by setting random delays during browsing to avoid repeated high-frequency requests. Implement retry mechanisms and error handling to ensure smooth web scraping.