pdf_tool/README.md

# PDF Text Extraction Daemon

Fast API-based service for extracting text from PDF files hosted on the internet.

## Quick Start

### Install dependencies
```bash
pip install -r requirements.txt
```

### Run the daemon
```bash
# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000

# Or run directly with custom port
python3 pdf_daemon.py --port 5006
```

## API Endpoints

### GET /health
Check if the service is running.

**Response:**
```json
{
    "status": "healthy",
    "service": "PDF Text Extraction Daemon"
}
```

### GET /
API information and available endpoints.

### GET /extract
Extract text from a PDF hosted at URL.

**Query Parameters:**
- `url` (required): Direct link to PDF file
- `output_file` (optional): Custom output filename

**Example:**
```
GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt
```

**Response:**
```json
{
    "success": true,
    "text": "Extracted text content...",
    "file_size_kb": 423.82,
    "pages": 5,
    "extraction_time_ms": 90.42,
    "message": "Successfully extracted 5 page(s)"
}
```

## Usage Examples

### cURL examples:

**Extract text from PDF URL:**
```bash
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"
```

**Extract and save to custom file:**
```bash
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"
```

**Check health:**
```bash
curl http://localhost:8000/health
```

### Python example:

```python
import requests

response = requests.get(
    "http://localhost:8000/extract",
    params={
        "url": "https://example.com/document.pdf"
    }
)

data = response.json()
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
print(data['text'])
```

## Performance

Using PyMuPDF (fitz), extraction is extremely fast:

| File Size | Pages | Time |
|-----------|-------|------|
| 423 KB    | 5     | ~90ms |
| 1.2 MB    | ~80   | ~260ms |

## Error Handling

The API returns appropriate HTTP status codes:

- `400` - Invalid URL or request format
- `404` - PDF not found at URL
- `500` - Server error (download/Extraction failed)

**Error Response:**
```json
{
    "detail": "Failed to download PDF: 404"
}
```

## Notes

### CLI Tool (`pdf_extractor.py`)
- Saves output to same directory as source PDF by default (with `.txt` extension)
- Use `--output` flag for custom output path

### API Daemon (`pdf_daemon.py`)
- Extracted text is always returned in the JSON response
- Optional `output_file` parameter saves a copy to `/tmp/` on the server
- Maximum download timeout: 60 seconds
- Supports both HTTP and HTTPS URLs