paper_firehose.core.text_utils¶

Shared text processing utilities.

Consolidates text normalization, cleaning, and matching functions used across the codebase for author names, abstracts, and other text fields.

Functions

`clean_abstract_for_db`(text)	Conservative sanitizer for abstracts before storing in database.
`names_match`(a, b)	Heuristic author-name comparator supporting initials and comma forms.
`normalize_name`(text)	Normalize a human name for loose matching.
`parse_name_parts`(name)	Parse a human name into (lastname, initials[]).
`strip_accents`(text)	Return ASCII-ish text by removing accent marks via Unicode normalization.
`strip_jats`(text)	Remove JATS/HTML tags and unescape entities in Crossref-style strings.

paper_firehose.core.text_utils.clean_abstract_for_db(text)[source]¶

Conservative sanitizer for abstracts before storing in database.

Performs comprehensive cleaning: - Removes JATS/HTML tags and unescapes entities via strip_jats() - Strips stray ‘<’ and ‘>’ characters (common artifact from feeds) - Removes leading feed prefixes like “Abstract” and arXiv announce headers - Normalizes whitespace and removes zero-width characters

Parameters:: text (Optional[str]) – Raw abstract text from API or feed
Return type:: Optional[str]
Returns:: Cleaned abstract ready for database storage, or None if input was None

Examples

>>> clean_abstract_for_db("Abstract: This is the abstract.")
'This is the abstract.'
>>> clean_abstract_for_db("arXiv:2509.09390v1 Announce Type: new Abstract: Text")
'Text'

paper_firehose.core.text_utils.names_match(a, b)[source]¶

Heuristic author-name comparator supporting initials and comma forms.

Compares two author names with fuzzy matching that handles: - Different name orderings (Last, First vs First Last) - Initials vs full first names - Accents and punctuation differences

Parameters:

a (str) – First author name
b (str) – Second author name

Return type:

bool

Returns:

True if names likely refer to the same person, False otherwise

Examples

>>> names_match("Smith, J. P.", "John P. Smith")
True
>>> names_match("J. Smith", "Jane Smith")
True
>>> names_match("J. Smith", "John Doe")
False

paper_firehose.core.text_utils.normalize_name(text)[source]¶

Normalize a human name for loose matching.

Strips accents, punctuation, and converts to lowercase for fuzzy name matching.

Parameters:: text (str) – Human name to normalize
Return type:: str
Returns:: Normalized name suitable for comparison

Examples

>>> normalize_name("García-López, José")
'garcia lopez jose'
>>> normalize_name("John P. Smith")
'john p smith'

paper_firehose.core.text_utils.parse_name_parts(name)[source]¶

Parse a human name into (lastname, initials[]).

Handles both “Last, First M” and “First M Last” styles, ignoring accents and case for robust parsing.

Parameters:: name (str) – Full name in various formats
Return type:: Tuple[str, List[str]]
Returns:: Tuple of (lastname, list of first/middle initials)

Examples

>>> parse_name_parts("Smith, John P.")
('smith', ['j', 'p'])
>>> parse_name_parts("John P. Smith")
('smith', ['j', 'p'])
>>> parse_name_parts("García-López, José")
('garcia lopez', ['j'])

paper_firehose.core.text_utils.strip_accents(text)[source]¶

Return ASCII-ish text by removing accent marks via Unicode normalization.

Useful for comparing author names and other text where accents should not affect matching.

Parameters:: text (str) – Text potentially containing accented characters
Return type:: str
Returns:: Text with accent marks removed

Examples

>>> strip_accents("José García")
'Jose Garcia'
>>> strip_accents("Müller")
'Muller'

paper_firehose.core.text_utils.strip_jats(text)[source]¶

Remove JATS/HTML tags and unescape entities in Crossref-style strings.

JATS (Journal Article Tag Suite) is an XML format used by publishers. Crossref and other APIs often return abstracts with JATS tags embedded.

Parameters:: text (Optional[str]) – Text potentially containing JATS/HTML tags
Return type:: Optional[str]
Returns:: Cleaned text with tags removed and entities unescaped, or None if input was None

Examples

>>> strip_jats("<jats:p>Some text</jats:p>")
'Some text'
>>> strip_jats("Text with &lt;angle&gt; brackets")
'Text with <angle> brackets'