This document addresses an issue where Google Maps scraping appears successful (data is fetched), but the file saving the data does not get updated. We will analyze the root causes and provide solutions.
1. Improper Handling of Asynchronous Operations
What are Asynchronous Operations?
Puppeteer heavily relies on asynchronous operations. These are operations where, after making a request, other tasks can be executed concurrently without waiting for the initial request to finish. The browser can continue processing other tasks, essentially working on multiple threads independently.
Advantages of Asynchronous Operations:
Improve application speed and throughput, optimizing user experience and enhancing overall application performance.
Increase application scalability by freeing up thread resources and avoiding resource depletion.
How to Fix This Issue?
First, ensure you use await before all Puppeteer and Node.js functions that return a Promise.
Second, use a for...of loop in combination with await to guarantee correct execution order.
Code Example:
javascript
async function main() {
const urls = ['url1', 'url2'];
let allData = [];
for (const url of urls) {
await page.goto(url);
const data = await page.evaluate(() => { /* ... */ });
allData.push(data);
}
}
2. Selector Issues
Selectors can fail. Google Maps updates its website structure, which can invalidate previously working selectors. You might also encounter situations where elements are null or not found (undefined).
Google Maps pages are complex, with dynamically loaded content and class names that can be confusing or change.
Solutions:
Verify Selectors Manually in the Page:
Add a pause in your script (e.g., await new Promise(resolve => setTimeout(resolve, 90000))) to keep the browser open longer. Open the Developer Tools in the Puppeteer-controlled browser and test your selector in the console using document.querySelector('your-selector') to verify its accuracy.
Use the page.waitForSelector Function:
Before interacting with or extracting data from an element, ensure it's fully loaded. Use page.waitForSelector to wait for the element to be present in the DOM, mitigating issues caused by dynamic content loading or network latency.
Code Example:
javascript
try {
const resultsSelector = 'div[role="feed"]';
await page.waitForSelector(resultsSelector, { timeout: 30000 });
console.log("The result container has been loaded, ready to fetch data....");
const data = await page.evaluate(() => {
// ... Crawling Logic...
});
if (data && data.length > 0) {
console.log(`Successfully captured ${data.length} pieces of data.`);
fs.writeFileSync('results.json', JSON.stringify(data, null, 2));
console.log('The file has been successfully saved!');
} else {
console.log('The data captured is empty,the document has not been updated.Please check the selector or page content.');
}
} catch (error) {
console.error("Capture failed or selector timed out:", error);
}
If the scraped data is null or undefined, it often means the selector found the element correctly, but an error occurred during data extraction from that element.
Solution:
Use the optional chaining operator (?.) to safely access nested properties and provide fallback values.
Code Example:
javascript
const title = item.querySelector('div.fontHeadlineLarge')?.innerText || 'Title not found';