Features: - CLI tool (pdf_extractor.py) for local files and URLs using PyMuPDF - FastAPI daemon (pdf_daemon.py) with GET /extract endpoint - Query parameter-based API for easier agent integration - Comprehensive test suites included Performance: - ~40-60x faster than pdfplumber (~50ms average extraction time) - Handles PDFs up to 36+ MB efficiently Documentation: - README.md with full API reference - QUICKSTART.md for both CLI and daemon modes |
||
|---|---|---|
| .gitignore | ||
| QUICKSTART.md | ||
| README.md | ||
| TEST_RESULTS.md | ||
| comprehensive_test.sh | ||
| pdf_daemon.py | ||
| pdf_extractor.py | ||
| requirements.txt | ||
| session | ||
| test_daemon.sh | ||
README.md
PDF Text Extraction Daemon
Fast API-based service for extracting text from PDF files hosted on the internet.
Quick Start
Install dependencies
pip install -r requirements.txt
Run the daemon
# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
# Or run directly with custom port
python3 pdf_daemon.py --port 5006
API Endpoints
GET /health
Check if the service is running.
Response:
{
"status": "healthy",
"service": "PDF Text Extraction Daemon"
}
GET /
API information and available endpoints.
GET /extract
Extract text from a PDF hosted at URL.
Query Parameters:
url(required): Direct link to PDF fileoutput_file(optional): Custom output filename
Example:
GET /extract?url=https://example.com/document.pdf&output_file=custom_output.txt
Response:
{
"success": true,
"text": "Extracted text content...",
"file_size_kb": 423.82,
"pages": 5,
"extraction_time_ms": 90.42,
"message": "Successfully extracted 5 page(s)"
}
Usage Examples
cURL examples:
Extract text from PDF URL:
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf"
Extract and save to custom file:
curl "http://localhost:8000/extract?url=https://www.pdf995.com/samples/pdf.pdf&output_file=my_output.txt"
Check health:
curl http://localhost:8000/health
Python example:
import requests
response = requests.get(
"http://localhost:8000/extract",
params={
"url": "https://example.com/document.pdf"
}
)
data = response.json()
print(f"Extracted {data['pages']} pages in {data['extraction_time_ms']:.2f}ms")
print(data['text'])
Performance
Using PyMuPDF (fitz), extraction is extremely fast:
| File Size | Pages | Time |
|---|---|---|
| 423 KB | 5 | ~90ms |
| 1.2 MB | ~80 | ~260ms |
Error Handling
The API returns appropriate HTTP status codes:
400- Invalid URL or request format404- PDF not found at URL500- Server error (download/Extraction failed)
Error Response:
{
"detail": "Failed to download PDF: 404"
}
Notes
CLI Tool (pdf_extractor.py)
- Saves output to same directory as source PDF by default (with
.txtextension) - Use
--outputflag for custom output path
API Daemon (pdf_daemon.py)
- Extracted text is always returned in the JSON response
- Optional
output_fileparameter saves a copy to/tmp/on the server - Maximum download timeout: 60 seconds
- Supports both HTTP and HTTPS URLs