Playwright Crawling Pipeline: Production Code
Playwright Crawling Pipeline: Production Code
Theory is over. Now we build.
The Structure
Collect → Filter → Store → Alert → Repeat
Five steps in code.
Step 1: Basic Collection
Price Monitoring Example
const { chromium } = require('playwright');
async function checkPrice() {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/product/12345');
const price = await page.locator('.price').textContent();
const priceNumber = parseInt(price.replace(/[^0-9]/g, ''));
await browser.close();
return priceNumber;
}
Key Points:
headless: true→ Background execution.locator()→ Auto-wait included- Always close browser
Step 2: Filtering Logic
const TARGET_PRICE = 50000;
async function monitorPrice() {
const currentPrice = await checkPrice();
if (currentPrice < TARGET_PRICE) {
console.log(`🚨 Price drop! ${currentPrice}`);
sendAlert(currentPrice);
}
}
Step 3: Data Storage
SQLite (Recommended)
const sqlite3 = require('sqlite3');
const db = new sqlite3.Database('prices.db');
db.run(`
CREATE TABLE IF NOT EXISTS prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id TEXT,
price INTEGER,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)
`);
function saveToDb(productId, price) {
db.run(
'INSERT INTO prices (product_id, price) VALUES (?, ?)',
[productId, price]
);
}
Step 4: Alerts
Telegram Bot
const axios = require('axios');
async function sendTelegramAlert(message) {
const url = `https://api.telegram.org/bot${TOKEN}/sendMessage`;
await axios.post(url, {
chat_id: CHAT_ID,
text: message
});
}
Step 5: Scheduling
node-cron
const cron = require('node-cron');
// Every 30 minutes
cron.schedule('*/30 * * * *', async () => {
await monitorPrice();
});
Complete Integration
const { chromium } = require('playwright');
const cron = require('node-cron');
const axios = require('axios');
const TARGET_PRICE = 50000;
async function checkPrice() {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/product/12345');
const price = await page.locator('.price').textContent();
await browser.close();
return parseInt(price.replace(/[^0-9]/g, ''));
}
async function sendAlert(price) {
const url = `https://api.telegram.org/bot${TOKEN}/sendMessage`;
await axios.post(url, {
chat_id: CHAT_ID,
text: `🚨 Price: ${price}`
});
}
async function monitor() {
const price = await checkPrice();
if (price < TARGET_PRICE) await sendAlert(price);
}
cron.schedule('*/30 * * * *', monitor);
Deployment
1) Local Server (24/7)
npm install -g pm2 pm2 start monitor.js --name price-monitor pm2 startup pm2 save
2) Cloud (AWS EC2)
# Install Node.js curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - sudo apt-get install -y nodejs # Install Playwright dependencies npx playwright install-deps # Run pm2 start monitor.js
3) GitHub Actions (Free)
name: Price Monitor
on:
schedule:
- cron: '0 */1 * * *'
jobs:
monitor:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: npm install
- run: npx playwright install chromium
- run: node monitor.js
Cost Structure
| Method | Cost |
|---|---|
| Local Server | Electricity only |
| AWS EC2 (t2.micro) | Free (1 year), then minimal |
| GitHub Actions | Free (public repos) |
Performance Optimization
1) Context Reuse
// ✅ One browser, multiple contexts
const browser = await chromium.launch();
async function optimized() {
const context = await browser.newContext();
const page = await context.newPage();
// ... work
await context.close();
}
Result: 70% memory reduction
2) Parallel Processing
const products = ['12345', '67890', '11111']; await Promise.all( products.map(id => checkPrice(id)) );
Error Handling
async function safeMonitor() {
const MAX_RETRIES = 3;
for (let i = 0; i < MAX_RETRIES; i++) {
try {
return await monitor();
} catch (error) {
if (i === MAX_RETRIES - 1) {
await sendAlert('⚠️ System error');
}
await new Promise(r => setTimeout(r, 60000));
}
}
}
Wrong vs. Right
❌ Wrong
- New browser every time
- No error handling
- Manual execution
- No logging
✅ Right
- Context reuse
- 3 retries + alerts
- Cron automation
- Complete logging
Pipeline built. Now just run it.
Operational insights coming in the next post.
댓글
댓글 쓰기