# PDF Text Extraction Daemon Fast API-based service for extracting text from PDF files hosted on the internet. ## Quick Start ### Install dependencies ```bash pip install -r requirements.txt ``` ### Run the daemon ```bash # Using uvicorn (default port 8000) uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000 # Or run directly with custom port python3 pdf_daemon.py --port 5006 ``` ## API Endpoints ### GET /health Check if the service is running. **Response:** ```json { "status": "healthy", "service": "PDF Text Extraction Daemon" } ``` ### GET / API information and available endpoints. ### GET /extract Extract text from a PDF hosted at URL. **Query Parameters:** - `url` (required): Direct link to PDF file - `output_file` (optional): Custom output filename **Example:** ``` GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt ``` **Response:** ```json { "success": true, "text": "Extracted text content...", "file_size_kb": 423.82, "pages": 5, "extraction_time_ms": 90.42, "message": "Successfully extracted 5 page(s)" } ``` ## Usage Examples ### cURL examples: **Extract text from PDF URL:** ```bash curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf" ``` **Extract and save to custom file:** ```bash curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt" ``` **Check health:** ```bash curl http://localhost:8000/health ``` ### Python example: ```python import requests response = requests.get( "http://localhost:8000/extract", params={ "url": "https://example.com/document.pdf" } ) data = response.json() print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms") print(data['text']) ``` ## Performance Using PyMuPDF (fitz), extraction is extremely fast: | File Size | Pages | Time | |-----------|-------|------| | 423 KB | 5 | ~90ms | | 1.2 MB | ~80 | ~260ms | ## Error Handling The API returns appropriate HTTP status codes: - `400` - Invalid URL or request format - `404` - PDF not found at URL - `500` - Server error (download/Extraction failed) **Error Response:** ```json { "detail": "Failed to download PDF: 404" } ``` ## Notes ### CLI Tool (`pdf_extractor.py`) - Saves output to same directory as source PDF by default (with `.txt` extension) - Use `--output` flag for custom output path ### API Daemon (`pdf_daemon.py`) - Extracted text is always returned in the JSON response - Optional `output_file` parameter saves a copy to `/tmp/` on the server - Maximum download timeout: 60 seconds - Supports both HTTP and HTTPS URLs