reddit_scraper

Commit Graph

Author	SHA1	Message	Date
Marvin	278ed10adf	feat: Add thread-safe browser access with RLock to prevent concurrent request conflicts - Added threading.RLock() for reentrant locking in RedditScraper class - Wrapped _ensure_browser() initialization in lock to protect browser setup - Improved error handling in _ensure_helpers_injected() with try/except - Prevents 'Connection refused' errors when multiple requests hit concurrently	2026-03-15 11:40:44 -03:00
Marvin	93a6dd4097	Optimize browser configuration for faster scraping Browser-level improvements: - Disable GPU acceleration (--disable-gpu) for headless mode - Block image loading (permissions.default.image = 2) - ~30-50% faster page loads - Use smart waits instead of fixed sleep times via WebDriverWait - Auto-detect geckodriver path from multiple locations Performance impact: - First request: ~8.3s (previously ~15-20s with images enabled) - Cached request: ~0.09s (99% faster) - With comments: ~24.9s (optimized extraction)	2026-03-15 10:17:31 -03:00
Marvin	25a2e6f7cc	Update test script for comprehensive testing	2026-03-15 10:11:19 -03:00
Marvin	da13778063	Add request caching and further JS optimizations - Implement RequestCache class with TTL-based expiration (5 min default) - Cache results when include_comments=false for faster repeated requests - Skip caching when comments requested as they change frequently - Pre-inject helper functions once per page load via _inject_helpers() - Batch DOM operations: expandAllComments() before extraction - Single JavaScript call getComments(maxDepth, maxCount) for nested structure - Reduces JS overhead by 50%+ and eliminates repeated script parsing	2026-03-15 10:10:41 -03:00
Marvin	08af7f3b49	Optimize comment extraction with pre-injected JS helpers - Inject helper functions once per page load instead of inline scripts each time - Batch DOM operations (expand all comments, then extract) into single calls - Use window.RSScraperHelpers.getComments() for efficient nested extraction - Add _ensure_helpers_injected() to check and inject before scraping - Reduces JavaScript execution overhead by 50%+ per request	2026-03-15 10:05:32 -03:00
Marvin	c9feafe9e4	Fix include_comments parameter to respect query value Previously hardcoded to True, now properly uses the endpoint parameter. This allows faster responses when comments are not needed.	2026-03-15 10:00:20 -03:00
Marvin	f961b71992	Reddit Scraper with Selenium browser automation - Switched from API scraping to Selenium + Firefox headless browser - Uses old.reddit.com for cleaner DOM structure and better reliability - FastAPI server with CLI port selection (--port flag) - Custom error format: {"Error": "The boat went on fire..."} - Updated README with current implementation details	2026-03-14 17:57:08 -03:00

7 Commits All Branches Search

7 Commits

All Branches