paper_firehose.core.doi_utils¶

Unified DOI extraction utilities.

Consolidates DOI extraction logic from database.py and abstracts.py into a single, well-tested implementation.

Functions

`extract_doi_from_entry`(entry)	Extract DOI from a feed entry dictionary.
`extract_doi_from_json`(raw_json)	Extract DOI from a raw JSON string.
`find_doi_in_text`(text)	Search a text string for a DOI pattern.

paper_firehose.core.doi_utils.extract_doi_from_entry(entry)[source]¶

Extract DOI from a feed entry dictionary.

Searches multiple common fields where DOIs appear in RSS/Atom feeds, including Dublin Core, PRISM, and standard RSS fields.

Parameters:: entry (Dict[str, Any]) – Feed entry dictionary (from feedparser or similar)
Return type:: Optional[str]
Returns:: DOI string if found, None otherwise

Field priority order:

Direct DOI fields (doi, dc_identifier, prism:doi, etc.)
ID and link fields
Summary/description fields
Content arrays
Links arrays

paper_firehose.core.doi_utils.extract_doi_from_json(raw_json)[source]¶

Extract DOI from a raw JSON string.

Useful when dealing with stored feed entry JSON payloads.

Parameters:: raw_json (Optional[str]) – JSON string containing feed entry data
Return type:: Optional[str]
Returns:: DOI string if found, None otherwise

paper_firehose.core.doi_utils.find_doi_in_text(text)[source]¶

Search a text string for a DOI pattern.

Strips common prefixes like ‘doi:’ before searching.

Parameters:: text (Optional[str]) – Text to search for DOI
Return type:: Optional[str]
Returns:: DOI string if found, None otherwise

Examples

>>> find_doi_in_text("doi:10.1234/example")
'10.1234/example'
>>> find_doi_in_text("https://doi.org/10.1234/example")
'10.1234/example'
>>> find_doi_in_text("no doi here")
None