# PDF Text Extraction - Quick Start Guide ## Components 1. **PDF Extractor CLI** (`pdf_extractor.py`) - Command-line tool for local files and URLs 2. **PDF Daemon API** (`pdf_daemon.py`) - FastAPI service for programmatic access --- ## Option 1: CLI Tool (Simple) ### Usage ```bash # Extract from local file (auto-saves to same directory with .txt extension) python3 pdf_extractor.py document.pdf # With custom output path python3 pdf_extractor.py document.pdf --output result.txt # From URL (downloads and extracts, saves to current directory) python3 pdf_extractor.py https://example.com/doc.pdf ``` ### Speed - ~0.39s for 1.2MB PDF (~80 pages) - ~0.40s for 437KB PDF (~15 pages) --- ## Option 2: API Daemon (Service Mode) ### Start the daemon ```bash # Using uvicorn (default port 8000) uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000 # Or run directly with custom port python3 pdf_daemon.py --port 5006 ``` ### API Endpoints **Health Check:** ```bash curl http://localhost:8000/health ``` **Extract PDF from URL:** ```bash curl "http://localhost:8000/extract?url=https://example.com/doc.pdf" ``` **With custom output file:** ```bash curl "http://localhost:8000/extract?url=https://example.com/doc.pdf&output_file=result.txt" ``` ### Python Client Example ```python import requests response = requests.get( "http://localhost:8000/extract", params={"url": "https://example.com/doc.pdf"} ) data = response.json() print(data['text']) print(f"Extracted in {data['extraction_time_ms']:.2f}ms") ``` --- ## Performance Summary | File | Size | Pages | Time | |------|------|-------|------| | Academic dissertation | 1.2 MB | ~80 | **~390ms** | | Technical spec | 437 KB | ~15 | **~400ms** | | Sample PDF (API test) | 424 KB | 5 | **~80ms** | Total round-trip time including download: typically <1 second for most PDFs. --- ## Installation ```bash cd /home/nicolas/pdf_tool pip install -r requirements.txt ``` Or manually: ```bash pip install pymupdf fastapi uvicorn aiohttp pydantic ``` --- ## Files Structure ``` /home/nicolas/pdf_tool/ ├── pdf_extractor.py # CLI tool for local files and URLs ├── pdf_daemon.py # FastAPI service daemon ├── test_daemon.sh # Basic API test suite ├── comprehensive_test.sh # Full test suite with sample-files.com PDFs ├── requirements.txt # Python dependencies ├── README.md # Full API documentation └── QUICKSTART.md # This quick start guide ```