Go to file

Nicolas Sanchez 392522402d Initial release: Fast PDF text extraction CLI and API daemon Features: - CLI tool (pdf_extractor.py) for local files and URLs using PyMuPDF - FastAPI daemon (pdf_daemon.py) with GET /extract endpoint - Query parameter-based API for easier agent integration - Comprehensive test suites included Performance: - ~40-60x faster than pdfplumber (~50ms average extraction time) - Handles PDFs up to 36+ MB efficiently Documentation: - README.md with full API reference - QUICKSTART.md for both CLI and daemon modes		2026-03-16 12:03:22 -03:00
.gitignore	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
QUICKSTART.md	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
README.md	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
TEST_RESULTS.md	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
comprehensive_test.sh	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
pdf_daemon.py	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
pdf_extractor.py	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
requirements.txt	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
session	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00
test_daemon.sh	Initial release: Fast PDF text extraction CLI and API daemon	2026-03-16 12:03:22 -03:00

README.md

PDF Text Extraction Daemon

Fast API-based service for extracting text from PDF files hosted on the internet.

Quick Start

Install dependencies

pip install -r requirements.txt

Run the daemon

# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000

# Or run directly with custom port
python3 pdf_daemon.py --port 5006

API Endpoints

GET /health

Check if the service is running.

Response:

{
    "status": "healthy",
    "service": "PDF Text Extraction Daemon"
}

GET /

API information and available endpoints.

GET /extract

Extract text from a PDF hosted at URL.

Query Parameters:

url (required): Direct link to PDF file
output_file (optional): Custom output filename

Example:

GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt

Response:

{
    "success": true,
    "text": "Extracted text content...",
    "file_size_kb": 423.82,
    "pages": 5,
    "extraction_time_ms": 90.42,
    "message": "Successfully extracted 5 page(s)"
}

Usage Examples

cURL examples:

Extract text from PDF URL:

curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"

Extract and save to custom file:

curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"

Check health:

curl http://localhost:8000/health

Python example:

import requests

response = requests.get(
    "http://localhost:8000/extract",
    params={
        "url": "https://example.com/document.pdf"
    }
)

data = response.json()
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
print(data['text'])

Performance

Using PyMuPDF (fitz), extraction is extremely fast:

File Size	Pages	Time
423 KB	5	~90ms
1.2 MB	~80	~260ms

Error Handling

The API returns appropriate HTTP status codes:

400 - Invalid URL or request format
404 - PDF not found at URL
500 - Server error (download/Extraction failed)

Error Response:

{
    "detail": "Failed to download PDF: 404"
}

Notes

CLI Tool (`pdf_extractor.py`)

Saves output to same directory as source PDF by default (with .txt extension)
Use --output flag for custom output path

API Daemon (`pdf_daemon.py`)

Extracted text is always returned in the JSON response
Optional output_file parameter saves a copy to /tmp/ on the server
Maximum download timeout: 60 seconds
Supports both HTTP and HTTPS URLs