paper_firehose.processors.abstract_fetcher

Multi-source abstract fetcher with fallback logic.

Orchestrates fetching abstracts from multiple sources (Crossref, Semantic Scholar, OpenAlex, PubMed) with intelligent fallback strategies based on journal/domain.

Functions

crossref_pass(db, topic, threshold, *, ...)

Second pass: Crossref only (DOI first, then title) for entries above threshold.

fallback_pass(db, topic, threshold, *, ...)

Third pass: remaining above-threshold entries → Semantic Scholar / OpenAlex / PubMed.

fill_arxiv_summaries(db[, topics])

First pass: fill abstracts from summary for arXiv/cond-mat entries, no threshold.

iter_targets(db, topic, threshold)

Yield ranked DB rows lacking abstracts for the given topic, highest score first.

try_abstract_sources(sources, doi, title, *, ...)

Try fetching abstract from a list of sources in order.

try_publisher_apis(doi, feed_name, link, *, ...)

Try publisher/aggregator APIs based on journal or domain.

paper_firehose.processors.abstract_fetcher.crossref_pass(db, topic, threshold, *, mailto, session, min_interval, max_per_topic, max_retries=3)[source]

Second pass: Crossref only (DOI first, then title) for entries above threshold.

Parameters:
  • db (DatabaseManager) – DatabaseManager instance

  • topic (str) – Topic name to process

  • threshold (float) – Minimum rank score to include

  • mailto (str) – Contact email for Crossref API

  • session (Session) – requests.Session for API calls

  • min_interval (float) – Minimum seconds between API calls

  • max_per_topic (Optional[int]) – Optional maximum fetches per topic

  • max_retries (int) – Maximum retry attempts for failed requests

Return type:

int

Returns:

Number of abstracts fetched

paper_firehose.processors.abstract_fetcher.fallback_pass(db, topic, threshold, *, mailto, session, min_interval, max_per_topic)[source]

Third pass: remaining above-threshold entries → Semantic Scholar / OpenAlex / PubMed.

Parameters:
  • db (DatabaseManager) – DatabaseManager instance

  • topic (str) – Topic name to process

  • threshold (float) – Minimum rank score to include

  • mailto (str) – Contact email for API calls

  • session (Session) – requests.Session for API calls

  • min_interval (float) – Minimum seconds between API calls

  • max_per_topic (Optional[int]) – Optional maximum fetches per topic

Return type:

int

Returns:

Number of abstracts fetched

paper_firehose.processors.abstract_fetcher.fill_arxiv_summaries(db, topics=None)[source]

First pass: fill abstracts from summary for arXiv/cond-mat entries, no threshold.

Parameters:
Return type:

int

Returns:

Number of rows updated

paper_firehose.processors.abstract_fetcher.iter_targets(db, topic, threshold)[source]

Yield ranked DB rows lacking abstracts for the given topic, highest score first.

Parameters:
  • db (DatabaseManager) – DatabaseManager instance

  • topic (str) – Topic name to filter by

  • threshold (float) – Minimum rank score to include

Yields:

Dictionary representing each database row

Return type:

Iterable[Dict[str, Any]]

paper_firehose.processors.abstract_fetcher.try_abstract_sources(sources, doi, title, *, mailto, session)[source]

Try fetching abstract from a list of sources in order.

Parameters:
  • sources (list[AbstractSource]) – List of AbstractSource instances to try in order

  • doi (Optional[str]) – Digital Object Identifier (optional)

  • title (Optional[str]) – Paper title (optional)

  • mailto (str) – Contact email for API calls

  • session (Optional[Session]) – requests.Session for API calls

Return type:

Optional[str]

Returns:

Abstract text or None if not found from any source

paper_firehose.processors.abstract_fetcher.try_publisher_apis(doi, feed_name, link, *, mailto, session)[source]

Try publisher/aggregator APIs based on journal or domain.

Order (by common coverage): Semantic Scholar, OpenAlex; for PNAS (or biomedical), try PubMed.

Parameters:
  • doi (Optional[str]) – Digital Object Identifier (optional)

  • feed_name (str) – Name of the RSS feed source

  • link (str) – URL to the paper

  • mailto (str) – Contact email for API calls

  • session (Optional[Session]) – requests.Session for API calls

Return type:

Optional[str]

Returns:

Abstract text or None if not found