paper_firehose.core.doi_utils

Unified DOI extraction utilities.

Consolidates DOI extraction logic from database.py and abstracts.py into a single, well-tested implementation.

Functions

extract_doi_from_entry(entry)

Extract DOI from a feed entry dictionary.

extract_doi_from_json(raw_json)

Extract DOI from a raw JSON string.

find_doi_in_text(text)

Search a text string for a DOI pattern.

paper_firehose.core.doi_utils.extract_doi_from_entry(entry)[source]

Extract DOI from a feed entry dictionary.

Searches multiple common fields where DOIs appear in RSS/Atom feeds, including Dublin Core, PRISM, and standard RSS fields.

Parameters:

entry (Dict[str, Any]) – Feed entry dictionary (from feedparser or similar)

Return type:

Optional[str]

Returns:

DOI string if found, None otherwise

Field priority order:
  1. Direct DOI fields (doi, dc_identifier, prism:doi, etc.)

  2. ID and link fields

  3. Summary/description fields

  4. Content arrays

  5. Links arrays

paper_firehose.core.doi_utils.extract_doi_from_json(raw_json)[source]

Extract DOI from a raw JSON string.

Useful when dealing with stored feed entry JSON payloads.

Parameters:

raw_json (Optional[str]) – JSON string containing feed entry data

Return type:

Optional[str]

Returns:

DOI string if found, None otherwise

paper_firehose.core.doi_utils.find_doi_in_text(text)[source]

Search a text string for a DOI pattern.

Strips common prefixes like ‘doi:’ before searching.

Parameters:

text (Optional[str]) – Text to search for DOI

Return type:

Optional[str]

Returns:

DOI string if found, None otherwise

Examples

>>> find_doi_in_text("doi:10.1234/example")
'10.1234/example'
>>> find_doi_in_text("https://doi.org/10.1234/example")
'10.1234/example'
>>> find_doi_in_text("no doi here")
None