paper_firehose.core.doi_utils¶
Unified DOI extraction utilities.
Consolidates DOI extraction logic from database.py and abstracts.py into a single, well-tested implementation.
Functions
|
Extract DOI from a feed entry dictionary. |
|
Extract DOI from a raw JSON string. |
|
Search a text string for a DOI pattern. |
- paper_firehose.core.doi_utils.extract_doi_from_entry(entry)[source]¶
Extract DOI from a feed entry dictionary.
Searches multiple common fields where DOIs appear in RSS/Atom feeds, including Dublin Core, PRISM, and standard RSS fields.
- Parameters:
entry (
Dict[str,Any]) – Feed entry dictionary (from feedparser or similar)- Return type:
- Returns:
DOI string if found, None otherwise
- Field priority order:
Direct DOI fields (doi, dc_identifier, prism:doi, etc.)
ID and link fields
Summary/description fields
Content arrays
Links arrays
- paper_firehose.core.doi_utils.extract_doi_from_json(raw_json)[source]¶
Extract DOI from a raw JSON string.
Useful when dealing with stored feed entry JSON payloads.
- paper_firehose.core.doi_utils.find_doi_in_text(text)[source]¶
Search a text string for a DOI pattern.
Strips common prefixes like ‘doi:’ before searching.
- Parameters:
- Return type:
- Returns:
DOI string if found, None otherwise
Examples
>>> find_doi_in_text("doi:10.1234/example") '10.1234/example' >>> find_doi_in_text("https://doi.org/10.1234/example") '10.1234/example' >>> find_doi_in_text("no doi here") None