paper_firehose.core.text_utils

Shared text processing utilities.

Consolidates text normalization, cleaning, and matching functions used across the codebase for author names, abstracts, and other text fields.

Functions

clean_abstract_for_db(text)

Conservative sanitizer for abstracts before storing in database.

names_match(a, b)

Heuristic author-name comparator supporting initials and comma forms.

normalize_name(text)

Normalize a human name for loose matching.

parse_name_parts(name)

Parse a human name into (lastname, initials[]).

strip_accents(text)

Return ASCII-ish text by removing accent marks via Unicode normalization.

strip_jats(text)

Remove JATS/HTML tags and unescape entities in Crossref-style strings.

paper_firehose.core.text_utils.clean_abstract_for_db(text)[source]

Conservative sanitizer for abstracts before storing in database.

Performs comprehensive cleaning: - Removes JATS/HTML tags and unescapes entities via strip_jats() - Strips stray ‘<’ and ‘>’ characters (common artifact from feeds) - Removes leading feed prefixes like “Abstract” and arXiv announce headers - Normalizes whitespace and removes zero-width characters

Parameters:

text (Optional[str]) – Raw abstract text from API or feed

Return type:

Optional[str]

Returns:

Cleaned abstract ready for database storage, or None if input was None

Examples

>>> clean_abstract_for_db("Abstract: This is the abstract.")
'This is the abstract.'
>>> clean_abstract_for_db("arXiv:2509.09390v1 Announce Type: new Abstract: Text")
'Text'
paper_firehose.core.text_utils.names_match(a, b)[source]

Heuristic author-name comparator supporting initials and comma forms.

Compares two author names with fuzzy matching that handles: - Different name orderings (Last, First vs First Last) - Initials vs full first names - Accents and punctuation differences

Parameters:
  • a (str) – First author name

  • b (str) – Second author name

Return type:

bool

Returns:

True if names likely refer to the same person, False otherwise

Examples

>>> names_match("Smith, J. P.", "John P. Smith")
True
>>> names_match("J. Smith", "Jane Smith")
True
>>> names_match("J. Smith", "John Doe")
False
paper_firehose.core.text_utils.normalize_name(text)[source]

Normalize a human name for loose matching.

Strips accents, punctuation, and converts to lowercase for fuzzy name matching.

Parameters:

text (str) – Human name to normalize

Return type:

str

Returns:

Normalized name suitable for comparison

Examples

>>> normalize_name("García-López, José")
'garcia lopez jose'
>>> normalize_name("John P. Smith")
'john p smith'
paper_firehose.core.text_utils.parse_name_parts(name)[source]

Parse a human name into (lastname, initials[]).

Handles both “Last, First M” and “First M Last” styles, ignoring accents and case for robust parsing.

Parameters:

name (str) – Full name in various formats

Return type:

Tuple[str, List[str]]

Returns:

Tuple of (lastname, list of first/middle initials)

Examples

>>> parse_name_parts("Smith, John P.")
('smith', ['j', 'p'])
>>> parse_name_parts("John P. Smith")
('smith', ['j', 'p'])
>>> parse_name_parts("García-López, José")
('garcia lopez', ['j'])
paper_firehose.core.text_utils.strip_accents(text)[source]

Return ASCII-ish text by removing accent marks via Unicode normalization.

Useful for comparing author names and other text where accents should not affect matching.

Parameters:

text (str) – Text potentially containing accented characters

Return type:

str

Returns:

Text with accent marks removed

Examples

>>> strip_accents("José García")
'Jose Garcia'
>>> strip_accents("Müller")
'Muller'
paper_firehose.core.text_utils.strip_jats(text)[source]

Remove JATS/HTML tags and unescape entities in Crossref-style strings.

JATS (Journal Article Tag Suite) is an XML format used by publishers. Crossref and other APIs often return abstracts with JATS tags embedded.

Parameters:

text (Optional[str]) – Text potentially containing JATS/HTML tags

Return type:

Optional[str]

Returns:

Cleaned text with tags removed and entities unescaped, or None if input was None

Examples

>>> strip_jats("<jats:p>Some text</jats:p>")
'Some text'
>>> strip_jats("Text with &lt;angle&gt; brackets")
'Text with <angle> brackets'