pdf_tool/QUICKSTART.md

2.4 KiB

PDF Text Extraction - Quick Start Guide

Components

  1. PDF Extractor CLI (pdf_extractor.py) - Command-line tool for local files and URLs
  2. PDF Daemon API (pdf_daemon.py) - FastAPI service for programmatic access

Option 1: CLI Tool (Simple)

Usage

# Extract from local file (auto-saves to same directory with .txt extension)
python3 pdf_extractor.py document.pdf

# With custom output path
python3 pdf_extractor.py document.pdf --output result.txt

# From URL (downloads and extracts, saves to current directory)
python3 pdf_extractor.py https://example.com/doc.pdf

Speed

  • ~0.39s for 1.2MB PDF (~80 pages)
  • ~0.40s for 437KB PDF (~15 pages)

Option 2: API Daemon (Service Mode)

Start the daemon

# Using uvicorn (default port 8000)
uvicorn pdf_daemon:app --host 0.0.0.0 --port 8000

# Or run directly with custom port
python3 pdf_daemon.py --port 5006

API Endpoints

Health Check:

curl http://localhost:8000/health

Extract PDF from URL:

curl "http://localhost:8000/extract?url=https://example.com/doc.pdf"

With custom output file:

curl "http://localhost:8000/extract?url=https://example.com/doc.pdf&output_file=result.txt"

Python Client Example

import requests

response = requests.get(
    "http://localhost:8000/extract",
    params={"url": "https://example.com/doc.pdf"}
)

data = response.json()
print(data['text'])
print(f"Extracted in {data['extraction_time_ms']:.2f}ms")

Performance Summary

File Size Pages Time
Academic dissertation 1.2 MB ~80 ~390ms
Technical spec 437 KB ~15 ~400ms
Sample PDF (API test) 424 KB 5 ~80ms

Total round-trip time including download: typically <1 second for most PDFs.


Installation

cd /home/nicolas/pdf_tool
pip install -r requirements.txt

Or manually:

pip install pymupdf fastapi uvicorn aiohttp pydantic

Files Structure

/home/nicolas/pdf_tool/
├── pdf_extractor.py      # CLI tool for local files and URLs
├── pdf_daemon.py         # FastAPI service daemon
├── test_daemon.sh       # Basic API test suite
├── comprehensive_test.sh # Full test suite with sample-files.com PDFs
├── requirements.txt     # Python dependencies
├── README.md           # Full API documentation
└── QUICKSTART.md       # This quick start guide