pdf_tool/README.md

2.6 KiB

PDF Text Extraction Daemon

Fast API-based service for extracting text from PDF files hosted on the internet.

Quick Start

Install dependencies

pip install -r requirements.txt

Run the daemon

# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000

# Or run directly with custom port
python3 pdf_daemon.py --port 5006

API Endpoints

GET /health

Check if the service is running.

Response:

{
    "status": "healthy",
    "service": "PDF Text Extraction Daemon"
}

GET /

API information and available endpoints.

GET /extract

Extract text from a PDF hosted at URL.

Query Parameters:

  • url (required): Direct link to PDF file
  • output_file (optional): Custom output filename

Example:

GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt

Response:

{
    "success": true,
    "text": "Extracted text content...",
    "file_size_kb": 423.82,
    "pages": 5,
    "extraction_time_ms": 90.42,
    "message": "Successfully extracted 5 page(s)"
}

Usage Examples

cURL examples:

Extract text from PDF URL:

curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"

Extract and save to custom file:

curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"

Check health:

curl http://localhost:8000/health

Python example:

import requests

response = requests.get(
    "http://localhost:8000/extract",
    params={
        "url": "https://example.com/document.pdf"
    }
)

data = response.json()
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
print(data['text'])

Performance

Using PyMuPDF (fitz), extraction is extremely fast:

File Size Pages Time
423 KB 5 ~90ms
1.2 MB ~80 ~260ms

Error Handling

The API returns appropriate HTTP status codes:

  • 400 - Invalid URL or request format
  • 404 - PDF not found at URL
  • 500 - Server error (download/Extraction failed)

Error Response:

{
    "detail": "Failed to download PDF: 404"
}

Notes

CLI Tool (pdf_extractor.py)

  • Saves output to same directory as source PDF by default (with .txt extension)
  • Use --output flag for custom output path

API Daemon (pdf_daemon.py)

  • Extracted text is always returned in the JSON response
  • Optional output_file parameter saves a copy to /tmp/ on the server
  • Maximum download timeout: 60 seconds
  • Supports both HTTP and HTTPS URLs