112 lines
2.4 KiB
Markdown
112 lines
2.4 KiB
Markdown
# PDF Text Extraction - Quick Start Guide
|
|
|
|
## Components
|
|
|
|
1. **PDF Extractor CLI** (`pdf_extractor.py`) - Command-line tool for local files and URLs
|
|
2. **PDF Daemon API** (`pdf_daemon.py`) - FastAPI service for programmatic access
|
|
|
|
---
|
|
|
|
## Option 1: CLI Tool (Simple)
|
|
|
|
### Usage
|
|
```bash
|
|
# Extract from local file (auto-saves to same directory with .txt extension)
|
|
python3 pdf_extractor.py document.pdf
|
|
|
|
# With custom output path
|
|
python3 pdf_extractor.py document.pdf --output result.txt
|
|
|
|
# From URL (downloads and extracts, saves to current directory)
|
|
python3 pdf_extractor.py https://example.com/doc.pdf
|
|
```
|
|
|
|
### Speed
|
|
- ~0.39s for 1.2MB PDF (~80 pages)
|
|
- ~0.40s for 437KB PDF (~15 pages)
|
|
|
|
---
|
|
|
|
## Option 2: API Daemon (Service Mode)
|
|
|
|
### Start the daemon
|
|
```bash
|
|
# Using uvicorn (default port 8000)
|
|
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000
|
|
|
|
# Or run directly with custom port
|
|
python3 pdf_daemon.py --port 5006
|
|
```
|
|
|
|
### API Endpoints
|
|
|
|
**Health Check:**
|
|
```bash
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
**Extract PDF from URL:**
|
|
```bash
|
|
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf"
|
|
```
|
|
|
|
**With custom output file:**
|
|
```bash
|
|
curl "http://localhost:8000/extract?url=https://example.com/doc.pdf&output_file=result.txt"
|
|
```
|
|
|
|
### Python Client Example
|
|
```python
|
|
import requests
|
|
|
|
response = requests.get(
|
|
"http://localhost:8000/extract",
|
|
params={"url": "https://example.com/doc.pdf"}
|
|
)
|
|
|
|
data = response.json()
|
|
print(data['text'])
|
|
print(f"Extracted in {data['extraction_time_ms']:.2f}ms")
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Summary
|
|
|
|
| File | Size | Pages | Time |
|
|
|------|------|-------|------|
|
|
| Academic dissertation | 1.2 MB | ~80 | **~390ms** |
|
|
| Technical spec | 437 KB | ~15 | **~400ms** |
|
|
| Sample PDF (API test) | 424 KB | 5 | **~80ms** |
|
|
|
|
Total round-trip time including download: typically <1 second for most PDFs.
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
cd /home/nicolas/pdf_tool
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
Or manually:
|
|
```bash
|
|
pip install pymupdf fastapi uvicorn aiohttp pydantic
|
|
```
|
|
|
|
---
|
|
|
|
## Files Structure
|
|
|
|
```
|
|
/home/nicolas/pdf_tool/
|
|
├── pdf_extractor.py # CLI tool for local files and URLs
|
|
├── pdf_daemon.py # FastAPI service daemon
|
|
├── test_daemon.sh # Basic API test suite
|
|
├── comprehensive_test.sh # Full test suite with sample-files.com PDFs
|
|
├── requirements.txt # Python dependencies
|
|
├── README.md # Full API documentation
|
|
└── QUICKSTART.md # This quick start guide
|
|
```
|