pdf_tool/README.md

132 lines
2.6 KiB
Markdown

# PDF Text Extraction Daemon
Fast API-based service for extracting text from PDF files hosted on the internet.
## Quick Start
### Install dependencies
```bash
pip install -r requirements.txt
```
### Run the daemon
```bash
# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
# Or run directly with custom port
python3 pdf_daemon.py --port 5006
```
## API Endpoints
### GET /health
Check if the service is running.
**Response:**
```json
{
"status": "healthy",
"service": "PDF Text Extraction Daemon"
}
```
### GET /
API information and available endpoints.
### GET /extract
Extract text from a PDF hosted at URL.
**Query Parameters:**
- `url` (required): Direct link to PDF file
- `output_file` (optional): Custom output filename
**Example:**
```
GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt
```
**Response:**
```json
{
"success": true,
"text": "Extracted text content...",
"file_size_kb": 423.82,
"pages": 5,
"extraction_time_ms": 90.42,
"message": "Successfully extracted 5 page(s)"
}
```
## Usage Examples
### cURL examples:
**Extract text from PDF URL:**
```bash
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"
```
**Extract and save to custom file:**
```bash
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"
```
**Check health:**
```bash
curl http://localhost:8000/health
```
### Python example:
```python
import requests
response = requests.get(
"http://localhost:8000/extract",
params={
"url": "https://example.com/document.pdf"
}
)
data = response.json()
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
print(data['text'])
```
## Performance
Using PyMuPDF (fitz), extraction is extremely fast:
| File Size | Pages | Time |
|-----------|-------|------|
| 423 KB | 5 | ~90ms |
| 1.2 MB | ~80 | ~260ms |
## Error Handling
The API returns appropriate HTTP status codes:
- `400` - Invalid URL or request format
- `404` - PDF not found at URL
- `500` - Server error (download/Extraction failed)
**Error Response:**
```json
{
"detail": "Failed to download PDF: 404"
}
```
## Notes
### CLI Tool (`pdf_extractor.py`)
- Saves output to same directory as source PDF by default (with `.txt` extension)
- Use `--output` flag for custom output path
### API Daemon (`pdf_daemon.py`)
- Extracted text is always returned in the JSON response
- Optional `output_file` parameter saves a copy to `/tmp/` on the server
- Maximum download timeout: 60 seconds
- Supports both HTTP and HTTPS URLs