pdf_tool/QUICKSTART.md

112 lines
2.4 KiB
Markdown

# PDF Text Extraction - Quick Start Guide
## Components
1. **PDF Extractor CLI** (`pdf_extractor.py`) - Command-line tool for local files and URLs
2. **PDF Daemon API** (`pdf_daemon.py`) - FastAPI service for programmatic access
---
## Option 1: CLI Tool (Simple)
### Usage
```bash
# Extract from local file (auto-saves to same directory with .txt extension)
python3 pdf_extractor.py document.pdf
# With custom output path
python3 pdf_extractor.py document.pdf --output result.txt
# From URL (downloads and extracts, saves to current directory)
python3 pdf_extractor.py https://example.com/doc.pdf
```
### Speed
- ~0.39s for 1.2MB PDF (~80 pages)
- ~0.40s for 437KB PDF (~15 pages)
---
## Option 2: API Daemon (Service Mode)
### Start the daemon
```bash
# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
# Or run directly with custom port
python3 pdf_daemon.py --port 5006
```
### API Endpoints
**Health Check:**
```bash
curl http://localhost:8000/health
```
**Extract PDF from URL:**
```bash
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf"
```
**With custom output file:**
```bash
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf&output_file=result.txt"
```
### Python Client Example
```python
import requests
response = requests.get(
"http://localhost:8000/extract",
params={"url": "https://example.com/doc.pdf"}
)
data = response.json()
print(data['text'])
print(f"Extracted in {data['extraction_time_ms']:.2f}ms")
```
---
## Performance Summary
| File | Size | Pages | Time |
|------|------|-------|------|
| Academic dissertation | 1.2 MB | ~80 | **~390ms** |
| Technical spec | 437 KB | ~15 | **~400ms** |
| Sample PDF (API test) | 424 KB | 5 | **~80ms** |
Total round-trip time including download: typically <1 second for most PDFs.
---
## Installation
```bash
cd /home/nicolas/pdf_tool
pip install -r requirements.txt
```
Or manually:
```bash
pip install pymupdf fastapi uvicorn aiohttp pydantic
```
---
## Files Structure
```
/home/nicolas/pdf_tool/
├── pdf_extractor.py # CLI tool for local files and URLs
├── pdf_daemon.py # FastAPI service daemon
├── test_daemon.sh # Basic API test suite
├── comprehensive_test.sh # Full test suite with sample-files.com PDFs
├── requirements.txt # Python dependencies
├── README.md # Full API documentation
└── QUICKSTART.md # This quick start guide
```