paper_firehose.commands.abstracts

Fetch abstracts and populate both papers.db (entries.abstract) and matched_entries_history.db (matched_entries.abstract).

Rules

  • First pass fills arXiv/cond-mat abstracts from summary (no threshold).

  • Then for rows with rank_score >= threshold: Crossref (DOI, then title search), followed by aggregator fallbacks (Semantic Scholar, OpenAlex, PubMed).

  • Only process topics where the topic YAML has abstract_fetch.enabled: true.

  • Use per-topic abstract_fetch.rank_threshold if set; otherwise fall back to global defaults.rank_threshold in config.yaml.

  • Respect API rate limits; include a descriptive User-Agent with contact email and obey Retry-After on 429/503 responses. Default to ~1 request/second.

Functions

run(config_path[, topic, mailto, ...])

Fetch and write abstracts into papers.db for ranked entries.

paper_firehose.commands.abstracts.run(config_path, topic=None, *, mailto=None, max_per_topic=None, rps=1.0, output_json=False)[source]

Fetch and write abstracts into papers.db for ranked entries.

Parameters:
  • config_path (str) – Path to the main configuration file

  • topic (Optional[str]) – Optional single topic; otherwise process all topics

  • mailto (Optional[str]) – Contact email for Crossref User-Agent

  • max_per_topic (Optional[int]) – Optional cap on number of fetches per topic

  • rps (float) – Requests per second throttle (default ~1 req/s)

  • output_json (bool) – When True, suppress log noise and return a result dict.

Return type:

Optional[Dict[str, Any]]

Returns:

Result dict when output_json is True, otherwise None.