132 lines
2.6 KiB
Markdown
132 lines
2.6 KiB
Markdown
# PDF Text Extraction Daemon
|
|
|
|
Fast API-based service for extracting text from PDF files hosted on the internet.
|
|
|
|
## Quick Start
|
|
|
|
### Install dependencies
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Run the daemon
|
|
```bash
|
|
# Using uvicorn (default port 8000)
|
|
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
|
|
|
|
# Or run directly with custom port
|
|
python3 pdf_daemon.py --port 5006
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### GET /health
|
|
Check if the service is running.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"service": "PDF Text Extraction Daemon"
|
|
}
|
|
```
|
|
|
|
### GET /
|
|
API information and available endpoints.
|
|
|
|
### GET /extract
|
|
Extract text from a PDF hosted at URL.
|
|
|
|
**Query Parameters:**
|
|
- `url` (required): Direct link to PDF file
|
|
- `output_file` (optional): Custom output filename
|
|
|
|
**Example:**
|
|
```
|
|
GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"text": "Extracted text content...",
|
|
"file_size_kb": 423.82,
|
|
"pages": 5,
|
|
"extraction_time_ms": 90.42,
|
|
"message": "Successfully extracted 5 page(s)"
|
|
}
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### cURL examples:
|
|
|
|
**Extract text from PDF URL:**
|
|
```bash
|
|
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"
|
|
```
|
|
|
|
**Extract and save to custom file:**
|
|
```bash
|
|
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"
|
|
```
|
|
|
|
**Check health:**
|
|
```bash
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
### Python example:
|
|
|
|
```python
|
|
import requests
|
|
|
|
response = requests.get(
|
|
"http://localhost:8000/extract",
|
|
params={
|
|
"url": "https://example.com/document.pdf"
|
|
}
|
|
)
|
|
|
|
data = response.json()
|
|
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
|
|
print(data['text'])
|
|
```
|
|
|
|
## Performance
|
|
|
|
Using PyMuPDF (fitz), extraction is extremely fast:
|
|
|
|
| File Size | Pages | Time |
|
|
|-----------|-------|------|
|
|
| 423 KB | 5 | ~90ms |
|
|
| 1.2 MB | ~80 | ~260ms |
|
|
|
|
## Error Handling
|
|
|
|
The API returns appropriate HTTP status codes:
|
|
|
|
- `400` - Invalid URL or request format
|
|
- `404` - PDF not found at URL
|
|
- `500` - Server error (download/Extraction failed)
|
|
|
|
**Error Response:**
|
|
```json
|
|
{
|
|
"detail": "Failed to download PDF: 404"
|
|
}
|
|
```
|
|
|
|
## Notes
|
|
|
|
### CLI Tool (`pdf_extractor.py`)
|
|
- Saves output to same directory as source PDF by default (with `.txt` extension)
|
|
- Use `--output` flag for custom output path
|
|
|
|
### API Daemon (`pdf_daemon.py`)
|
|
- Extracted text is always returned in the JSON response
|
|
- Optional `output_file` parameter saves a copy to `/tmp/` on the server
|
|
- Maximum download timeout: 60 seconds
|
|
- Supports both HTTP and HTTPS URLs
|