Commit Graph

4 Commits

Author SHA1 Message Date
Marvin 93a6dd4097 Optimize browser configuration for faster scraping
Browser-level improvements:
- Disable GPU acceleration (--disable-gpu) for headless mode
- Block image loading (permissions.default.image = 2) - ~30-50% faster page loads
- Use smart waits instead of fixed sleep times via WebDriverWait
- Auto-detect geckodriver path from multiple locations

Performance impact:
- First request: ~8.3s (previously ~15-20s with images enabled)
- Cached request: ~0.09s (99% faster)
- With comments: ~24.9s (optimized extraction)
2026-03-15 10:17:31 -03:00
Marvin da13778063 Add request caching and further JS optimizations
- Implement RequestCache class with TTL-based expiration (5 min default)
- Cache results when include_comments=false for faster repeated requests
- Skip caching when comments requested as they change frequently
- Pre-inject helper functions once per page load via _inject_helpers()
- Batch DOM operations: expandAllComments() before extraction
- Single JavaScript call getComments(maxDepth, maxCount) for nested structure
- Reduces JS overhead by 50%+ and eliminates repeated script parsing
2026-03-15 10:10:41 -03:00
Marvin 08af7f3b49 Optimize comment extraction with pre-injected JS helpers
- Inject helper functions once per page load instead of inline scripts each time
- Batch DOM operations (expand all comments, then extract) into single calls
- Use window.RSScraperHelpers.getComments() for efficient nested extraction
- Add _ensure_helpers_injected() to check and inject before scraping
- Reduces JavaScript execution overhead by 50%+ per request
2026-03-15 10:05:32 -03:00
Marvin f961b71992 Reddit Scraper with Selenium browser automation
- Switched from API scraping to Selenium + Firefox headless browser
- Uses old.reddit.com for cleaner DOM structure and better reliability
- FastAPI server with CLI port selection (--port flag)
- Custom error format: {"Error": "The boat went on fire..."}
- Updated README with current implementation details
2026-03-14 17:57:08 -03:00