paper_firehose.commands.pqa_summary¶
Paper-QA Summarizer¶
This command selects entries from papers.db for a given topic with
rank_score >= the configured threshold and downloads arXiv PDFs for them,
adhering to arXiv API Terms of Use (polite rate limiting and descriptive
User-Agent with contact email).
Workflow¶
Load configuration and identify ranked entries above threshold
Download arXiv PDFs (respecting rate limits, reusing archived copies)
Run
paper-qaover each PDF to produce grounded JSON summariesWrite summaries into papers.db and matched_entries_history.db
Architecture: PaperQASession¶
The paper-qa library uses a persistent index stored in ~/.pqa/ (or wherever
PQA_HOME points). This creates a critical issue when processing multiple PDFs:
The Problem:
Paper-qa reads and caches PQA_HOME at import time. Python’s import system
caches modules, so subsequent import paperqa statements return the cached
module without re-reading environment variables. This means:
First PDF: Set PQA_HOME=/tmp/A, import paperqa → paperqa caches /tmp/A
Process PDF successfully, clean up /tmp/A
Second PDF: Set PQA_HOME=/tmp/B, import paperqa → NO-OP, module cached!
Paperqa still uses /tmp/A (now deleted) → “no papers found” error
The Solution: PaperQASession
We use a context manager that:
Creates ONE temp directory for the entire summarization session
Sets
PQA_HOMEand changes working directory BEFORE importing paperqaImports paperqa ONCE (it correctly caches the session directory)
Processes each PDF in isolated paper/index directories
Cleans up the session directory on exit
This ensures paperqa always sees a valid, consistent environment throughout the entire run, regardless of how many PDFs are processed.
Why it works locally but fails on VPS:
Local: Fresh Python process per test → import cache empty → works
VPS (pipx): Long-running process or module caching → import cache populated from first PDF → subsequent PDFs fail
Usage Example¶
with PaperQASession(llm='gpt-4o', summary_llm='gpt-4o-mini') as session:
for pdf_path in pdf_paths:
answer = session.summarize_pdf(pdf_path, "Summarize this paper")
if answer:
print(answer)
Functions
|
Execute the paper-qa download + summarization workflow. |
Classes
|
Context manager for processing multiple PDFs with paper-qa. |
- class paper_firehose.commands.pqa_summary.PaperQASession(llm=None, summary_llm=None)[source]¶
Bases:
objectContext manager for processing multiple PDFs with paper-qa.
This class solves a critical issue with paper-qa’s environment handling: paper-qa reads
PQA_HOMEat import time and caches it internally. Python’s import system caches modules, so settingPQA_HOMEbefore subsequent imports has no effect - the module is already loaded with the old value.Solution Architecture¶
Instead of creating a new temp directory for each PDF (which fails after the first PDF because paperqa is already imported with the old path), we create ONE session directory and process all PDFs within it:
Session Start ├── Create /tmp/paperqa_session_xxx/ ├── Set PQA_HOME=/tmp/paperqa_session_xxx ├── Change CWD to /tmp/paperqa_session_xxx └── Import paperqa (caches /tmp/paperqa_session_xxx) ✓ For each PDF: ├── Create /tmp/paperqa_session_xxx/paper_*/ and index_*/ dirs ├── Copy PDF into paper_*/ and run ask() with those dirs └── Remove per-PDF dirs to keep runs isolated Session End ├── Restore original CWD ├── Restore original PQA_HOME └── Delete /tmp/paperqa_session_xxx/
Key Design Decisions¶
Single import: paperqa is imported exactly once per session, so it correctly caches the session’s temp directory.
Per-PDF isolation: Each PDF gets its own paper/index directories, so there is no cross-contamination or stale index state between runs.
PDF removal: We remove each PDF after processing to ensure paper-qa only sees one PDF at a time during indexing.
Environment restoration: We carefully restore
PQA_HOMEand CWD on exit, even if an exception occurs.
Attributes¶
- llmstr or None
The LLM model to use for paper-qa (e.g., ‘gpt-4o’, ‘gpt-5.2’).
- summary_llmstr or None
The summary LLM model (e.g., ‘gpt-4o-mini’).
- temp_dirstr or None
Path to the session’s temporary directory (set on __enter__).
- original_cwdstr or None
The working directory before session start (for restoration).
- original_pqa_homestr or None
The PQA_HOME value before session start (for restoration).
Example¶
pdf_paths = ['/path/to/paper1.pdf', '/path/to/paper2.pdf'] with PaperQASession(llm='gpt-4o', summary_llm='gpt-4o-mini') as session: for pdf in pdf_paths: answer = session.summarize_pdf(pdf, "Summarize this paper") if answer: process_answer(answer) # Environment automatically restored, temp files cleaned up
- summarize_pdf(pdf_path, question)[source]¶
Process a single PDF and return paper-qa’s answer as a string.
This method handles the per-PDF processing workflow:
Create per-PDF paper and index directories
Copy the PDF into the per-PDF paper directory
3. Build paper-qa Settings with those isolated directories 3. Run paper-qa’s ask() function asynchronously 4. Extract the answer string from paper-qa’s response object 5. Clean up: remove per-PDF directories to keep runs isolated
Each run uses fresh directories, so paper-qa only sees one PDF per call without having to clear or reconcile a shared index.
Parameters¶
- pdf_pathstr
Absolute path to the PDF file to process. The file is copied to the session’s temp directory, so the original is not modified.
- questionstr
The question to ask paper-qa about the PDF. This is typically a prompt asking for a JSON-formatted summary with ‘summary’ and ‘methods’ keys.
Returns¶
- str or None
The answer string from paper-qa if successful, None if: - Session not initialized (called outside ‘with’ block) - PDF copy failed - paper-qa query failed - Answer extraction failed
Note¶
This method handles async/event loop edge cases: - If asyncio.run() fails due to existing event loop, we spawn
a background thread with its own event loop
This is necessary because Jupyter notebooks and some frameworks already have an event loop running
Example¶
with PaperQASession() as session: answer = session.summarize_pdf( '/path/to/paper.pdf', 'Summarize this paper. Return JSON with summary and methods.' ) if answer: data = json.loads(answer)
- paper_firehose.commands.pqa_summary.run(config_path, topic=None, *, rps=None, limit=None, arxiv=None, entry_ids=None, use_history=False, history_date=None, history_feed_like=None)[source]¶
Execute the paper-qa download + summarization workflow.
- Return type:
- Parameters:
Workflow overview¶
Load configuration/database state and prepare download/archive folders.
Determine targets either from ranked topic entries (respecting the download rank threshold and optional
limit) or from the explicitarxiv/entry_idsarguments, optionally pulling metadata from the history database whenuse_historyis enabled.Resolve arXiv IDs, reuse archived PDFs when possible, download missing PDFs under the configured rate limit, and archive successful downloads.
Run
paper-qaon each PDF, normalize the JSON result, and write summaries back to bothpapers.dbandmatched_entries_history.dbwhen anentry_idis available.