234 lines
5.5 KiB
Markdown
234 lines
5.5 KiB
Markdown
# Reddit Super Duper Scraper 🔍
|
|
|
|
A powerful tool to scrape public Reddit data using **Selenium browser automation** - no authentication required. Accessible via local network only.
|
|
|
|
## Features
|
|
|
|
- **No Authentication Required**: Uses Selenium + Firefox to scrape old.reddit.com directly
|
|
- **Browser-Based Scraping**: Avoids API rate limits by simulating real browser behavior
|
|
- **Flexible Scraping**: Extract posts from subreddits with optional nested comments
|
|
- **Configurable Depth**: Control comment nesting level (1-10 levels)
|
|
- **Local Network Only**: Runs on your local network, no security overhead
|
|
- **FastAPI Backend**: Automatic documentation at `/docs`
|
|
|
|
## Installation
|
|
|
|
1. Install Python dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. Ensure geckodriver is installed (should be auto-detected):
|
|
```bash
|
|
# Check if available
|
|
which geckoddriver
|
|
# Expected path on most systems: /snap/bin/geckodriver or /usr/local/bin/geckodriver
|
|
```
|
|
|
|
3. Run the scraper:
|
|
```bash
|
|
# Using the main script with custom port
|
|
python reddit-scraper.py --port 6969
|
|
|
|
# Or directly
|
|
python main.py --port 8000
|
|
```
|
|
|
|
## Usage
|
|
|
|
### API Endpoints
|
|
|
|
#### 1. Scrape Subreddit Top Posts (Recommended)
|
|
```bash
|
|
GET /scrape/subreddit/{subreddit}?limit=10&time_range=week&depth=1&include_comments=true|false
|
|
```
|
|
|
|
**Parameters:**
|
|
- `limit`: Number of posts to retrieve (1-100, default: 10)
|
|
- `time_range`: Time filter (`hour`, `day`, `week`, `month`, `year`, `all`)
|
|
- `depth`: Comment nesting depth (1-10, default: **1**)
|
|
- `include_comments`: Enable comment scraping (true/false, default: true)
|
|
|
|
**Example - Posts Only (Fastest):**
|
|
```bash
|
|
curl "http://localhost:8000/scrape/subreddit/python?limit=5&include_comments=false"
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"subreddit": "python",
|
|
"time_range": "week",
|
|
"limit": 5,
|
|
"posts_count": 5,
|
|
"data": [
|
|
{
|
|
"title": "Python tips and tricks...",
|
|
"author": "pythonista",
|
|
"score": 1234,
|
|
"created_utc": 1709827200,
|
|
"url": "https://old.reddit.com/r/python/comments/...",
|
|
"permalink": "/r/python/comments/..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Example - With Comments:**
|
|
```bash
|
|
curl "http://localhost:8000/scrape/subreddit/python?limit=3&include_comments=true"
|
|
```
|
|
|
|
#### 2. Scrape Specific Post (Browser-Based)
|
|
```bash
|
|
GET /scrape/post/{post_id}?depth=3
|
|
```
|
|
|
|
**Example:**
|
|
```bash
|
|
curl "http://localhost:8000/scrape/post/1rt20n2"
|
|
```
|
|
|
|
#### 3. Custom Scraping (Flexible)
|
|
```bash
|
|
POST /scrape/custom
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"type": "subreddit",
|
|
"target": "programming",
|
|
"limit": 10,
|
|
"time_range": "week",
|
|
"depth": 3,
|
|
"include_comments": true
|
|
}
|
|
```
|
|
|
|
#### 4. Health Check
|
|
```bash
|
|
GET /health
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"message": "The ship is sailing smoothly"
|
|
}
|
|
```
|
|
|
|
### API Documentation
|
|
|
|
Once running, visit:
|
|
- **Swagger UI**: http://localhost:{port}/docs
|
|
- **ReDoc**: http://localhost:{port}/redoc
|
|
|
|
## CLI Options
|
|
|
|
```bash
|
|
python reddit-scraper.py --help
|
|
|
|
Usage: reddit-scraper.py [-h] [--port PORT]
|
|
|
|
Options:
|
|
-h, --help show this help message and exit
|
|
--port PORT Port to run the server on (default: 8000)
|
|
```
|
|
|
|
## Output Format
|
|
|
|
### Post Structure
|
|
```json
|
|
{
|
|
"title": "...",
|
|
"author": "...",
|
|
"score": ...,
|
|
"created_utc": ...,
|
|
"url": "...",
|
|
"permalink": "/r/{sub}/comments/...",
|
|
"comments": [...] // Empty if include_comments=false or no comments available
|
|
}
|
|
```
|
|
|
|
### Comment Structure (Nested)
|
|
```json
|
|
{
|
|
"author": "...",
|
|
"body": "...",
|
|
"score": ...,
|
|
"created_utc": ...,
|
|
"replies": [
|
|
{
|
|
"author": "...",
|
|
"body": "...",
|
|
"score": ...,
|
|
"replies": [] // Can nest up to depth levels
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
All errors return a consistent format:
|
|
```json
|
|
{
|
|
"Error": "The boat went on fire (message)"
|
|
}
|
|
```
|
|
|
|
Common error scenarios:
|
|
- **Browser Not Available**: geckodriver or Firefox not installed
|
|
- **Invalid Subreddit**: Post not found or subreddit doesn't exist
|
|
- **Network Error**: Connection issues during scraping
|
|
- **Rate Limited**: Reddit temporarily blocked the browser session
|
|
|
|
## How It Works
|
|
|
|
This scraper uses **Selenium with headless Firefox** to scrape old.reddit.com directly:
|
|
|
|
1. Launches a headless Firefox browser via geckodriver
|
|
2. Navigates to `https://old.reddit.com/r/{subreddit}/top/`
|
|
3. Uses JavaScript evaluation to extract post/comment data from the DOM
|
|
4. Returns structured JSON response
|
|
|
|
### Why old.reddit.com?
|
|
- Cleaner, more stable DOM structure than the main Reddit site
|
|
- Less aggressive bot detection
|
|
- Simpler HTML that's easier to parse reliably
|
|
|
|
## Limitations
|
|
|
|
### Comment Extraction Notes
|
|
Reddit's UI includes navigation elements (like "more comments" expanders) that may appear in scraped data. The scraper attempts to filter these, but some edge cases may occur with deeply nested or collapsed comment threads.
|
|
|
|
### Browser Dependencies
|
|
- Requires Firefox browser installed
|
|
- Requires geckodriver for Selenium automation
|
|
- Both must be compatible versions
|
|
|
|
## Configuration
|
|
|
|
Edit `config.py` to customize:
|
|
- Reddit URL endpoints (uses old.reddit.com by default)
|
|
- Default limits and depths
|
|
- Rate limiting delays
|
|
- Maximum depth constraints
|
|
- Server defaults (host, port)
|
|
|
|
```python
|
|
# config.py examples
|
|
REDDIT_SUBreddit_TOP_URL = "https://old.reddit.com/r/{}/top.json"
|
|
DEFAULT_DEPTH = 1
|
|
MAX_DEPTH = 10
|
|
DEFAULT_PORT = 8000
|
|
```
|
|
|
|
## Security Note
|
|
|
|
This service is designed for local network use only. No authentication tokens are required on the API itself, so ensure the server is not exposed to public networks without additional security measures.
|
|
|
|
## License
|
|
|
|
MIT License - Feel free to modify and use as needed!
|