paper_firehose.commands.pqa_summary

Paper-QA Summarizer

This command selects entries from papers.db for a given topic with rank_score >= the configured threshold and downloads arXiv PDFs for them, adhering to arXiv API Terms of Use (polite rate limiting and descriptive User-Agent with contact email).

Workflow

  1. Load configuration and identify ranked entries above threshold

  2. Download arXiv PDFs (respecting rate limits, reusing archived copies)

  3. Run paper-qa over each PDF to produce grounded JSON summaries

  4. Write summaries into papers.db and matched_entries_history.db

Architecture: PaperQASession

The paper-qa library uses a persistent index stored in ~/.pqa/ (or wherever PQA_HOME points). This creates a critical issue when processing multiple PDFs:

The Problem:

Paper-qa reads and caches PQA_HOME at import time. Python’s import system caches modules, so subsequent import paperqa statements return the cached module without re-reading environment variables. This means:

  1. First PDF: Set PQA_HOME=/tmp/A, import paperqa → paperqa caches /tmp/A

  2. Process PDF successfully, clean up /tmp/A

  3. Second PDF: Set PQA_HOME=/tmp/B, import paperqa → NO-OP, module cached!

  4. Paperqa still uses /tmp/A (now deleted) → “no papers found” error

The Solution: PaperQASession

We use a context manager that:

  1. Creates ONE temp directory for the entire summarization session

  2. Sets PQA_HOME and changes working directory BEFORE importing paperqa

  3. Imports paperqa ONCE (it correctly caches the session directory)

  4. Processes each PDF in isolated paper/index directories

  5. Cleans up the session directory on exit

This ensures paperqa always sees a valid, consistent environment throughout the entire run, regardless of how many PDFs are processed.

Why it works locally but fails on VPS:

  • Local: Fresh Python process per test → import cache empty → works

  • VPS (pipx): Long-running process or module caching → import cache populated from first PDF → subsequent PDFs fail

Usage Example

with PaperQASession(llm='gpt-4o', summary_llm='gpt-4o-mini') as session:
    for pdf_path in pdf_paths:
        answer = session.summarize_pdf(pdf_path, "Summarize this paper")
        if answer:
            print(answer)

Functions

run(config_path[, topic, rps, limit, arxiv, ...])

Execute the paper-qa download + summarization workflow.

Classes

PaperQASession([llm, summary_llm])

Context manager for processing multiple PDFs with paper-qa.

class paper_firehose.commands.pqa_summary.PaperQASession(llm=None, summary_llm=None)[source]

Bases: object

Context manager for processing multiple PDFs with paper-qa.

This class solves a critical issue with paper-qa’s environment handling: paper-qa reads PQA_HOME at import time and caches it internally. Python’s import system caches modules, so setting PQA_HOME before subsequent imports has no effect - the module is already loaded with the old value.

Solution Architecture

Instead of creating a new temp directory for each PDF (which fails after the first PDF because paperqa is already imported with the old path), we create ONE session directory and process all PDFs within it:

Session Start
├── Create /tmp/paperqa_session_xxx/
├── Set PQA_HOME=/tmp/paperqa_session_xxx
├── Change CWD to /tmp/paperqa_session_xxx
└── Import paperqa (caches /tmp/paperqa_session_xxx) ✓

For each PDF:
├── Create /tmp/paperqa_session_xxx/paper_*/ and index_*/ dirs
├── Copy PDF into paper_*/ and run ask() with those dirs
└── Remove per-PDF dirs to keep runs isolated

Session End
├── Restore original CWD
├── Restore original PQA_HOME
└── Delete /tmp/paperqa_session_xxx/

Key Design Decisions

  1. Single import: paperqa is imported exactly once per session, so it correctly caches the session’s temp directory.

  2. Per-PDF isolation: Each PDF gets its own paper/index directories, so there is no cross-contamination or stale index state between runs.

  3. PDF removal: We remove each PDF after processing to ensure paper-qa only sees one PDF at a time during indexing.

  4. Environment restoration: We carefully restore PQA_HOME and CWD on exit, even if an exception occurs.

Attributes

llmstr or None

The LLM model to use for paper-qa (e.g., ‘gpt-4o’, ‘gpt-5.2’).

summary_llmstr or None

The summary LLM model (e.g., ‘gpt-4o-mini’).

temp_dirstr or None

Path to the session’s temporary directory (set on __enter__).

original_cwdstr or None

The working directory before session start (for restoration).

original_pqa_homestr or None

The PQA_HOME value before session start (for restoration).

Example

pdf_paths = ['/path/to/paper1.pdf', '/path/to/paper2.pdf']

with PaperQASession(llm='gpt-4o', summary_llm='gpt-4o-mini') as session:
    for pdf in pdf_paths:
        answer = session.summarize_pdf(pdf, "Summarize this paper")
        if answer:
            process_answer(answer)
# Environment automatically restored, temp files cleaned up
summarize_pdf(pdf_path, question)[source]

Process a single PDF and return paper-qa’s answer as a string.

This method handles the per-PDF processing workflow:

  1. Create per-PDF paper and index directories

  2. Copy the PDF into the per-PDF paper directory

3. Build paper-qa Settings with those isolated directories 3. Run paper-qa’s ask() function asynchronously 4. Extract the answer string from paper-qa’s response object 5. Clean up: remove per-PDF directories to keep runs isolated

Each run uses fresh directories, so paper-qa only sees one PDF per call without having to clear or reconcile a shared index.

Parameters

pdf_pathstr

Absolute path to the PDF file to process. The file is copied to the session’s temp directory, so the original is not modified.

questionstr

The question to ask paper-qa about the PDF. This is typically a prompt asking for a JSON-formatted summary with ‘summary’ and ‘methods’ keys.

Returns

str or None

The answer string from paper-qa if successful, None if: - Session not initialized (called outside ‘with’ block) - PDF copy failed - paper-qa query failed - Answer extraction failed

Note

This method handles async/event loop edge cases: - If asyncio.run() fails due to existing event loop, we spawn

a background thread with its own event loop

  • This is necessary because Jupyter notebooks and some frameworks already have an event loop running

Example

with PaperQASession() as session:
    answer = session.summarize_pdf(
        '/path/to/paper.pdf',
        'Summarize this paper. Return JSON with summary and methods.'
    )
    if answer:
        data = json.loads(answer)
rtype:

Optional[str]

Parameters:
  • pdf_path (str)

  • question (str)

Return type:

str | None

Parameters:
  • llm (Optional[str])

  • summary_llm (Optional[str])

paper_firehose.commands.pqa_summary.run(config_path, topic=None, *, rps=None, limit=None, arxiv=None, entry_ids=None, use_history=False, history_date=None, history_feed_like=None)[source]

Execute the paper-qa download + summarization workflow.

Return type:

None

Parameters:
  • config_path (str)

  • topic (str | None)

  • rps (float | None)

  • limit (int | None)

  • arxiv (List[str] | None)

  • entry_ids (List[str] | None)

  • use_history (bool)

  • history_date (str | None)

  • history_feed_like (str | None)

Workflow overview

  • Load configuration/database state and prepare download/archive folders.

  • Determine targets either from ranked topic entries (respecting the download rank threshold and optional limit) or from the explicit arxiv/entry_ids arguments, optionally pulling metadata from the history database when use_history is enabled.

  • Resolve arXiv IDs, reuse archived PDFs when possible, download missing PDFs under the configured rate limit, and archive successful downloads.

  • Run paper-qa on each PDF, normalize the JSON result, and write summaries back to both papers.db and matched_entries_history.db when an entry_id is available.