paper_firehose.commands.abstracts¶

Fetch abstracts and populate both papers.db (entries.abstract) and matched_entries_history.db (matched_entries.abstract).

Rules¶

First pass fills arXiv/cond-mat abstracts from summary (no threshold).
Then for rows with rank_score >= threshold: Crossref (DOI, then title search), followed by aggregator fallbacks (Semantic Scholar, OpenAlex, PubMed).
Only process topics where the topic YAML has abstract_fetch.enabled: true.
Use per-topic abstract_fetch.rank_threshold if set; otherwise fall back to global defaults.rank_threshold in config.yaml.
Respect API rate limits; include a descriptive User-Agent with contact email and obey Retry-After on 429/503 responses. Default to ~1 request/second.

Functions

run(config_path[, topic, mailto, ...])

Fetch and write abstracts into papers.db for ranked entries.

paper_firehose.commands.abstracts.run(config_path, topic=None, *, mailto=None, max_per_topic=None, rps=1.0, output_json=False)[source]¶

Fetch and write abstracts into papers.db for ranked entries.

Parameters:

config_path (str) – Path to the main configuration file
topic (Optional[str]) – Optional single topic; otherwise process all topics
mailto (Optional[str]) – Contact email for Crossref User-Agent
max_per_topic (Optional[int]) – Optional cap on number of fetches per topic
rps (float) – Requests per second throttle (default ~1 req/s)
output_json (bool) – When True, suppress log noise and return a result dict.

Return type:

Optional[Dict[str, Any]]

Returns:

Result dict when output_json is True, otherwise None.