Go to file

Marvin f961b71992 Reddit Scraper with Selenium browser automation - Switched from API scraping to Selenium + Firefox headless browser - Uses old.reddit.com for cleaner DOM structure and better reliability - FastAPI server with CLI port selection (--port flag) - Custom error format: {"Error": "The boat went on fire..."} - Updated README with current implementation details		2026-03-14 17:57:08 -03:00
scraper	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
.gitignore	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
README.md	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
config.py	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
main.py	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
models.py	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
reddit-scraper.py	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
requirements.txt	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00
test_api.py	Reddit Scraper with Selenium browser automation	2026-03-14 17:57:08 -03:00

README.md

Reddit Super Duper Scraper 🔍

A powerful tool to scrape public Reddit data using Selenium browser automation - no authentication required. Accessible via local network only.

Features

No Authentication Required: Uses Selenium + Firefox to scrape old.reddit.com directly
Browser-Based Scraping: Avoids API rate limits by simulating real browser behavior
Flexible Scraping: Extract posts from subreddits with optional nested comments
Configurable Depth: Control comment nesting level (1-10 levels)
Local Network Only: Runs on your local network, no security overhead
FastAPI Backend: Automatic documentation at /docs

Installation

Install Python dependencies:

pip install -r requirements.txt

Ensure geckodriver is installed (should be auto-detected):

# Check if available
which geckoddriver
# Expected path on most systems: /snap/bin/geckodriver or /usr/local/bin/geckodriver

Run the scraper:

# Using the main script with custom port
python reddit-scraper.py --port 6969

# Or directly
python main.py --port 8000

Usage

API Endpoints

1. Scrape Subreddit Top Posts (Recommended)

GET /scrape/subreddit/{subreddit}?limit=10&time_range=week&depth=1&include_comments=true|false

Parameters:

limit: Number of posts to retrieve (1-100, default: 10)
time_range: Time filter (hour, day, week, month, year, all)
depth: Comment nesting depth (1-10, default: 1)
include_comments: Enable comment scraping (true/false, default: true)

Example - Posts Only (Fastest):

curl "http://localhost:8000/scrape/subreddit/python?limit=5&include_comments=false"

Response:

{
  "subreddit": "python",
  "time_range": "week",
  "limit": 5,
  "posts_count": 5,
  "data": [
    {
      "title": "Python tips and tricks...",
      "author": "pythonista",
      "score": 1234,
      "created_utc": 1709827200,
      "url": "https://old.reddit.com/r/python/comments/...",
      "permalink": "/r/python/comments/..."
    }
  ]
}

Example - With Comments:

curl "http://localhost:8000/scrape/subreddit/python?limit=3&include_comments=true"

2. Scrape Specific Post (Browser-Based)

GET /scrape/post/{post_id}?depth=3

Example:

curl "http://localhost:8000/scrape/post/1rt20n2"

3. Custom Scraping (Flexible)

POST /scrape/custom
Content-Type: application/json

{
  "type": "subreddit",
  "target": "programming",
  "limit": 10,
  "time_range": "week",
  "depth": 3,
  "include_comments": true
}

4. Health Check

GET /health

Response:

{
  "status": "healthy",
  "message": "The ship is sailing smoothly"
}

API Documentation

Once running, visit:

Swagger UI: http://localhost:{port}/docs
ReDoc: http://localhost:{port}/redoc

CLI Options

python reddit-scraper.py --help

Usage: reddit-scraper.py [-h] [--port PORT]

Options:
  -h, --help   show this help message and exit
  --port PORT  Port to run the server on (default: 8000)

Output Format

Post Structure

{
  "title": "...",
  "author": "...",
  "score": ...,
  "created_utc": ...,
  "url": "...",
  "permalink": "/r/{sub}/comments/...",
  "comments": [...]  // Empty if include_comments=false or no comments available
}

Comment Structure (Nested)

{
  "author": "...",
  "body": "...",
  "score": ...,
  "created_utc": ...,
  "replies": [
    {
      "author": "...",
      "body": "...",
      "score": ...,
      "replies": []  // Can nest up to depth levels
    }
  ]
}

Error Handling

All errors return a consistent format:

{
  "Error": "The boat went on fire (message)"
}

Common error scenarios:

Browser Not Available: geckodriver or Firefox not installed
Invalid Subreddit: Post not found or subreddit doesn't exist
Network Error: Connection issues during scraping
Rate Limited: Reddit temporarily blocked the browser session

How It Works

This scraper uses Selenium with headless Firefox to scrape old.reddit.com directly:

Launches a headless Firefox browser via geckodriver
Navigates to https://old.reddit.com/r/{subreddit}/top/
Uses JavaScript evaluation to extract post/comment data from the DOM
Returns structured JSON response

Why old.reddit.com?

Cleaner, more stable DOM structure than the main Reddit site
Less aggressive bot detection
Simpler HTML that's easier to parse reliably

Limitations

Comment Extraction Notes

Reddit's UI includes navigation elements (like "more comments" expanders) that may appear in scraped data. The scraper attempts to filter these, but some edge cases may occur with deeply nested or collapsed comment threads.

Browser Dependencies

Requires Firefox browser installed
Requires geckodriver for Selenium automation
Both must be compatible versions

Configuration

Edit config.py to customize:

Reddit URL endpoints (uses old.reddit.com by default)
Default limits and depths
Rate limiting delays
Maximum depth constraints
Server defaults (host, port)

# config.py examples
REDDIT_SUBreddit_TOP_URL = "https://old.reddit.com/r/{}/top.json"
DEFAULT_DEPTH = 1
MAX_DEPTH = 10
DEFAULT_PORT = 8000

Security Note

This service is designed for local network use only. No authentication tokens are required on the API itself, so ensure the server is not exposed to public networks without additional security measures.

License

MIT License - Feel free to modify and use as needed!