2.4 KiB
2.4 KiB
PDF Text Extraction - Quick Start Guide
Components
- PDF Extractor CLI (
pdf_extractor.py) - Command-line tool for local files and URLs - PDF Daemon API (
pdf_daemon.py) - FastAPI service for programmatic access
Option 1: CLI Tool (Simple)
Usage
# Extract from local file (auto-saves to same directory with .txt extension)
python3 pdf_extractor.py document.pdf
# With custom output path
python3 pdf_extractor.py document.pdf --output result.txt
# From URL (downloads and extracts, saves to current directory)
python3 pdf_extractor.py https://example.com/doc.pdf
Speed
- ~0.39s for 1.2MB PDF (~80 pages)
- ~0.40s for 437KB PDF (~15 pages)
Option 2: API Daemon (Service Mode)
Start the daemon
# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
# Or run directly with custom port
python3 pdf_daemon.py --port 5006
API Endpoints
Health Check:
curl http://localhost:8000/health
Extract PDF from URL:
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf"
With custom output file:
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf&output_file=result.txt"
Python Client Example
import requests
response = requests.get(
"http://localhost:8000/extract",
params={"url": "https://example.com/doc.pdf"}
)
data = response.json()
print(data['text'])
print(f"Extracted in {data['extraction_time_ms']:.2f}ms")
Performance Summary
| File | Size | Pages | Time |
|---|---|---|---|
| Academic dissertation | 1.2 MB | ~80 | ~390ms |
| Technical spec | 437 KB | ~15 | ~400ms |
| Sample PDF (API test) | 424 KB | 5 | ~80ms |
Total round-trip time including download: typically <1 second for most PDFs.
Installation
cd /home/nicolas/pdf_tool
pip install -r requirements.txt
Or manually:
pip install pymupdf fastapi uvicorn aiohttp pydantic
Files Structure
/home/nicolas/pdf_tool/
├── pdf_extractor.py # CLI tool for local files and URLs
├── pdf_daemon.py # FastAPI service daemon
├── test_daemon.sh # Basic API test suite
├── comprehensive_test.sh # Full test suite with sample-files.com PDFs
├── requirements.txt # Python dependencies
├── README.md # Full API documentation
└── QUICKSTART.md # This quick start guide