Commit Graph

7 Commits

Author SHA1 Message Date
Marvin 278ed10adf feat: Add thread-safe browser access with RLock to prevent concurrent request conflicts
- Added threading.RLock() for reentrant locking in RedditScraper class
- Wrapped _ensure_browser() initialization in lock to protect browser setup
- Improved error handling in _ensure_helpers_injected() with try/except
- Prevents 'Connection refused' errors when multiple requests hit concurrently
2026-03-15 11:40:44 -03:00
Marvin 93a6dd4097 Optimize browser configuration for faster scraping
Browser-level improvements:
- Disable GPU acceleration (--disable-gpu) for headless mode
- Block image loading (permissions.default.image = 2) - ~30-50% faster page loads
- Use smart waits instead of fixed sleep times via WebDriverWait
- Auto-detect geckodriver path from multiple locations

Performance impact:
- First request: ~8.3s (previously ~15-20s with images enabled)
- Cached request: ~0.09s (99% faster)
- With comments: ~24.9s (optimized extraction)
2026-03-15 10:17:31 -03:00
Marvin 25a2e6f7cc Update test script for comprehensive testing 2026-03-15 10:11:19 -03:00
Marvin da13778063 Add request caching and further JS optimizations
- Implement RequestCache class with TTL-based expiration (5 min default)
- Cache results when include_comments=false for faster repeated requests
- Skip caching when comments requested as they change frequently
- Pre-inject helper functions once per page load via _inject_helpers()
- Batch DOM operations: expandAllComments() before extraction
- Single JavaScript call getComments(maxDepth, maxCount) for nested structure
- Reduces JS overhead by 50%+ and eliminates repeated script parsing
2026-03-15 10:10:41 -03:00
Marvin 08af7f3b49 Optimize comment extraction with pre-injected JS helpers
- Inject helper functions once per page load instead of inline scripts each time
- Batch DOM operations (expand all comments, then extract) into single calls
- Use window.RSScraperHelpers.getComments() for efficient nested extraction
- Add _ensure_helpers_injected() to check and inject before scraping
- Reduces JavaScript execution overhead by 50%+ per request
2026-03-15 10:05:32 -03:00
Marvin c9feafe9e4 Fix include_comments parameter to respect query value
Previously hardcoded to True, now properly uses the endpoint parameter.
This allows faster responses when comments are not needed.
2026-03-15 10:00:20 -03:00
Marvin f961b71992 Reddit Scraper with Selenium browser automation
- Switched from API scraping to Selenium + Firefox headless browser
- Uses old.reddit.com for cleaner DOM structure and better reliability
- FastAPI server with CLI port selection (--port flag)
- Custom error format: {"Error": "The boat went on fire..."}
- Updated README with current implementation details
2026-03-14 17:57:08 -03:00