pdf_tool/TEST_RESULTS.md

4.8 KiB

PDF Text Extraction Test Results

Test Environment

  • Service: FastAPI Daemon (uvicorn)
  • Extraction Engine: PyMuPDF (fitz)
  • Server: localhost:8000

Comprehensive Test Results

1. Basic Text Document ✓ PASS

  • File: basic-text.pdf
  • Size: 72.9 KB
  • Pages: 1
  • Extraction Time: 7.43ms
  • Round-trip Time: 1,878ms (including download)
  • Content Quality: ✓ Excellent - preserves formatting, lists, bold/italic text

2. Image-Heavy Document ✓ PASS

  • File: image-doc.pdf
  • Size: 7.97 MB
  • Pages: 6
  • Extraction Time: 43.73ms
  • Round-trip Time: 4,454ms (including download)
  • Content Quality: ✓ Excellent - text extracted correctly despite images

3. Fillable Form ✓ PASS

  • File: fillable-form.pdf
  • Size: 52.7 KB
  • Pages: 2
  • Extraction Time: 11.23ms
  • Round-trip Time: 1,864ms (including download)
  • Content Quality: ✓ Excellent - form fields and labels extracted

4. Developer Example ✓ PASS

  • File: dev-example.pdf
  • Size: 690 KB
  • Pages: 6
  • Extraction Time: 75.1ms
  • Round-trip Time: 3,091ms (including download)
  • Content Quality: ✓ Excellent - various PDF features handled

5. Multi-Page Report ✓ PASS

  • File: sample-report.pdf
  • Size: 2.39 MB
  • Pages: 10
  • Extraction Time: 130.19ms
  • Round-trip Time: ~4,000ms (including download)
  • Content Quality: ✓ Excellent - tables and complex layouts

6. Large Document (100 pages) ✓ PASS

  • File: large-doc.pdf
  • Size: 36.8 MB
  • Pages: 100
  • Extraction Time: 89.82ms
  • Round-trip Time: ~5,000ms (including download)
  • Content Quality: ✓ Excellent - all pages extracted successfully

7. Small Files (Various Sizes) ✓ PASS

File Pages Extraction Time
sample-pdf-a4-size-65kb.pdf 5 17.49ms
sample-text-only-pdf-a4-size.pdf 5 23.62ms
sample-5-page-pdf-a4-size.pdf 5 21.05ms

Error Handling Tests

Invalid URL Format ✓ PASS

  • Test: URL without http:// protocol
  • Result: Correctly rejected with error message
  • Error Message: "URL must start with http:// or https://"

Non-existent PDF ✓ PASS

  • Test: URL to non-existent file
  • Result: Returns 404 error
  • Error Message: "Failed to download PDF: 404"

Password Protected PDF ✓ PASS (Graceful Failure)

  • File: protected.pdf
  • Expected Behavior: Should fail gracefully
  • Result: Extraction failed with clear message
  • Error Message: "Extraction failed: document closed or encrypted"

Output File Test ✓ PASS

  • Test: Custom output file parameter
  • Result: File created successfully at /tmp/test_output.txt
  • File Size: 916 bytes (basic-text.pdf)

Performance Summary

Extraction Speed by File Size

Category Size Range Pages Avg Time Total Round-Trip
Small <100 KB 1-5 ~15ms ~2,000ms
Medium 100KB - 3MB 6-10 ~70ms ~3,500ms
Large >3MB 10+ ~80ms ~4,500ms

Key Performance Metrics

  • Fastest: Basic text (7.43ms)
  • Slowest Extraction: Multi-page report (130.19ms)
  • Largest File Handled: 36.8 MB (100 pages) in ~90ms
  • Average Extraction Time: ~50ms

Round-Tip Times Include:

  1. HTTP connection establishment
  2. PDF download from remote server
  3. Text extraction via PyMuPDF
  4. JSON serialization and response

Content Quality Assessment

Preserved Elements ✓

  • Paragraph structure
  • Lists (ordered and unordered)
  • Form labels and fields
  • Headers and titles
  • Basic text formatting hints

Expected Limitations

  • Complex table layouts may lose some alignment
  • Images are not extracted (text-only mode)
  • Password-protected PDFs cannot be processed without password

Test Summary

Category Tests Run Passed Failed
Basic Functionality 6 6 0
Error Handling 3 3 0
Output File 1 1 0
Total 10 10 0

✓ ALL TESTS PASSED!


Recommendations

  1. For Production Use: The daemon handles various PDF types reliably
  2. Large Files: Can efficiently process files up to 36+ MB
  3. Error Handling: Graceful failures with clear error messages
  4. Performance: Extraction is extremely fast (<100ms typically)
  5. Limitations: Password-protected PDFs require manual handling

Sample API Response (Success)

{
    "success": true,
    "text": "Sample Document for PDF Testing\nIntroduction...",
    "file_size_kb": 72.91,
    "pages": 1,
    "extraction_time_ms": 7.43,
    "message": "Successfully extracted 1 page(s)"
}

Sample API Response (Error)

{
    "detail": "Extraction failed: document closed or encrypted"
}