paper_firehose.processors.abstract_fetcher¶
Multi-source abstract fetcher with fallback logic.
Orchestrates fetching abstracts from multiple sources (Crossref, Semantic Scholar, OpenAlex, PubMed) with intelligent fallback strategies based on journal/domain.
Functions
|
Second pass: Crossref only (DOI first, then title) for entries above threshold. |
|
Third pass: remaining above-threshold entries → Semantic Scholar / OpenAlex / PubMed. |
|
First pass: fill abstracts from summary for arXiv/cond-mat entries, no threshold. |
|
Yield ranked DB rows lacking abstracts for the given topic, highest score first. |
|
Try fetching abstract from a list of sources in order. |
|
Try publisher/aggregator APIs based on journal or domain. |
- paper_firehose.processors.abstract_fetcher.crossref_pass(db, topic, threshold, *, mailto, session, min_interval, max_per_topic, max_retries=3)[source]¶
Second pass: Crossref only (DOI first, then title) for entries above threshold.
- Parameters:
db (
DatabaseManager) – DatabaseManager instancetopic (
str) – Topic name to processthreshold (
float) – Minimum rank score to includemailto (
str) – Contact email for Crossref APIsession (
Session) – requests.Session for API callsmin_interval (
float) – Minimum seconds between API callsmax_per_topic (
Optional[int]) – Optional maximum fetches per topicmax_retries (
int) – Maximum retry attempts for failed requests
- Return type:
- Returns:
Number of abstracts fetched
- paper_firehose.processors.abstract_fetcher.fallback_pass(db, topic, threshold, *, mailto, session, min_interval, max_per_topic)[source]¶
Third pass: remaining above-threshold entries → Semantic Scholar / OpenAlex / PubMed.
- Parameters:
db (
DatabaseManager) – DatabaseManager instancetopic (
str) – Topic name to processthreshold (
float) – Minimum rank score to includemailto (
str) – Contact email for API callssession (
Session) – requests.Session for API callsmin_interval (
float) – Minimum seconds between API callsmax_per_topic (
Optional[int]) – Optional maximum fetches per topic
- Return type:
- Returns:
Number of abstracts fetched
- paper_firehose.processors.abstract_fetcher.fill_arxiv_summaries(db, topics=None)[source]¶
First pass: fill abstracts from summary for arXiv/cond-mat entries, no threshold.
- Parameters:
db (
DatabaseManager) – DatabaseManager instancetopics (
Optional[list[str]]) – Optional list of topics to process (None = all topics)
- Return type:
- Returns:
Number of rows updated
- paper_firehose.processors.abstract_fetcher.iter_targets(db, topic, threshold)[source]¶
Yield ranked DB rows lacking abstracts for the given topic, highest score first.
- paper_firehose.processors.abstract_fetcher.try_abstract_sources(sources, doi, title, *, mailto, session)[source]¶
Try fetching abstract from a list of sources in order.
- Parameters:
- Return type:
- Returns:
Abstract text or None if not found from any source
- paper_firehose.processors.abstract_fetcher.try_publisher_apis(doi, feed_name, link, *, mailto, session)[source]¶
Try publisher/aggregator APIs based on journal or domain.
Order (by common coverage): Semantic Scholar, OpenAlex; for PNAS (or biomedical), try PubMed.
- Parameters:
- Return type:
- Returns:
Abstract text or None if not found