Playwright Crawling Pipeline: Production Code

Playwright Crawling Pipeline: Production Code

Theory is over. Now we build.


The Structure

Collect → Filter → Store → Alert → Repeat

Five steps in code.


Step 1: Basic Collection

Price Monitoring Example

const { chromium } = require('playwright');

async function checkPrice() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  
  await page.goto('https://example.com/product/12345');
  const price = await page.locator('.price').textContent();
  const priceNumber = parseInt(price.replace(/[^0-9]/g, ''));
  
  await browser.close();
  return priceNumber;
}

Key Points:

  • headless: true → Background execution
  • .locator() → Auto-wait included
  • Always close browser

Step 2: Filtering Logic

const TARGET_PRICE = 50000;

async function monitorPrice() {
  const currentPrice = await checkPrice();
  
  if (currentPrice < TARGET_PRICE) {
    console.log(`🚨 Price drop! ${currentPrice}`);
    sendAlert(currentPrice);
  }
}

Step 3: Data Storage

SQLite (Recommended)

const sqlite3 = require('sqlite3');
const db = new sqlite3.Database('prices.db');

db.run(`
  CREATE TABLE IF NOT EXISTS prices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    product_id TEXT,
    price INTEGER,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
  )
`);

function saveToDb(productId, price) {
  db.run(
    'INSERT INTO prices (product_id, price) VALUES (?, ?)',
    [productId, price]
  );
}

Step 4: Alerts

Telegram Bot

const axios = require('axios');

async function sendTelegramAlert(message) {
  const url = `https://api.telegram.org/bot${TOKEN}/sendMessage`;
  await axios.post(url, {
    chat_id: CHAT_ID,
    text: message
  });
}

Step 5: Scheduling

node-cron

const cron = require('node-cron');

// Every 30 minutes
cron.schedule('*/30 * * * *', async () => {
  await monitorPrice();
});

Complete Integration

const { chromium } = require('playwright');
const cron = require('node-cron');
const axios = require('axios');

const TARGET_PRICE = 50000;

async function checkPrice() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto('https://example.com/product/12345');
  const price = await page.locator('.price').textContent();
  await browser.close();
  return parseInt(price.replace(/[^0-9]/g, ''));
}

async function sendAlert(price) {
  const url = `https://api.telegram.org/bot${TOKEN}/sendMessage`;
  await axios.post(url, {
    chat_id: CHAT_ID,
    text: `🚨 Price: ${price}`
  });
}

async function monitor() {
  const price = await checkPrice();
  if (price < TARGET_PRICE) await sendAlert(price);
}

cron.schedule('*/30 * * * *', monitor);

Deployment

1) Local Server (24/7)

npm install -g pm2
pm2 start monitor.js --name price-monitor
pm2 startup
pm2 save

2) Cloud (AWS EC2)

# Install Node.js
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs

# Install Playwright dependencies
npx playwright install-deps

# Run
pm2 start monitor.js

3) GitHub Actions (Free)

name: Price Monitor
on:
  schedule:
    - cron: '0 */1 * * *'
jobs:
  monitor:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm install
      - run: npx playwright install chromium
      - run: node monitor.js

Cost Structure

Method Cost
Local Server Electricity only
AWS EC2 (t2.micro) Free (1 year), then minimal
GitHub Actions Free (public repos)

Performance Optimization

1) Context Reuse

// ✅ One browser, multiple contexts
const browser = await chromium.launch();

async function optimized() {
  const context = await browser.newContext();
  const page = await context.newPage();
  // ... work
  await context.close();
}

Result: 70% memory reduction

2) Parallel Processing

const products = ['12345', '67890', '11111'];

await Promise.all(
  products.map(id => checkPrice(id))
);

Error Handling

async function safeMonitor() {
  const MAX_RETRIES = 3;
  
  for (let i = 0; i < MAX_RETRIES; i++) {
    try {
      return await monitor();
    } catch (error) {
      if (i === MAX_RETRIES - 1) {
        await sendAlert('⚠️ System error');
      }
      await new Promise(r => setTimeout(r, 60000));
    }
  }
}

Wrong vs. Right

❌ Wrong

  • New browser every time
  • No error handling
  • Manual execution
  • No logging

✅ Right

  • Context reuse
  • 3 retries + alerts
  • Cron automation
  • Complete logging

Pipeline built. Now just run it.

Operational insights coming in the next post.

댓글

이 블로그의 인기 게시물

Did AI Really Build an $1.8B Company? - What Matthew Gallagher’s Case Actually Reveals

Why Simple Problems Create the Best Business Ideas