paper_firehose.core.text_utils¶
Shared text processing utilities.
Consolidates text normalization, cleaning, and matching functions used across the codebase for author names, abstracts, and other text fields.
Functions
|
Conservative sanitizer for abstracts before storing in database. |
|
Heuristic author-name comparator supporting initials and comma forms. |
|
Normalize a human name for loose matching. |
|
Parse a human name into (lastname, initials[]). |
|
Return ASCII-ish text by removing accent marks via Unicode normalization. |
|
Remove JATS/HTML tags and unescape entities in Crossref-style strings. |
- paper_firehose.core.text_utils.clean_abstract_for_db(text)[source]¶
Conservative sanitizer for abstracts before storing in database.
Performs comprehensive cleaning: - Removes JATS/HTML tags and unescapes entities via strip_jats() - Strips stray ‘<’ and ‘>’ characters (common artifact from feeds) - Removes leading feed prefixes like “Abstract” and arXiv announce headers - Normalizes whitespace and removes zero-width characters
- Parameters:
- Return type:
- Returns:
Cleaned abstract ready for database storage, or None if input was None
Examples
>>> clean_abstract_for_db("Abstract: This is the abstract.") 'This is the abstract.' >>> clean_abstract_for_db("arXiv:2509.09390v1 Announce Type: new Abstract: Text") 'Text'
- paper_firehose.core.text_utils.names_match(a, b)[source]¶
Heuristic author-name comparator supporting initials and comma forms.
Compares two author names with fuzzy matching that handles: - Different name orderings (Last, First vs First Last) - Initials vs full first names - Accents and punctuation differences
- Parameters:
- Return type:
- Returns:
True if names likely refer to the same person, False otherwise
Examples
>>> names_match("Smith, J. P.", "John P. Smith") True >>> names_match("J. Smith", "Jane Smith") True >>> names_match("J. Smith", "John Doe") False
- paper_firehose.core.text_utils.normalize_name(text)[source]¶
Normalize a human name for loose matching.
Strips accents, punctuation, and converts to lowercase for fuzzy name matching.
- Parameters:
text (
str) – Human name to normalize- Return type:
- Returns:
Normalized name suitable for comparison
Examples
>>> normalize_name("García-López, José") 'garcia lopez jose' >>> normalize_name("John P. Smith") 'john p smith'
- paper_firehose.core.text_utils.parse_name_parts(name)[source]¶
Parse a human name into (lastname, initials[]).
Handles both “Last, First M” and “First M Last” styles, ignoring accents and case for robust parsing.
- Parameters:
name (
str) – Full name in various formats- Return type:
- Returns:
Tuple of (lastname, list of first/middle initials)
Examples
>>> parse_name_parts("Smith, John P.") ('smith', ['j', 'p']) >>> parse_name_parts("John P. Smith") ('smith', ['j', 'p']) >>> parse_name_parts("García-López, José") ('garcia lopez', ['j'])
- paper_firehose.core.text_utils.strip_accents(text)[source]¶
Return ASCII-ish text by removing accent marks via Unicode normalization.
Useful for comparing author names and other text where accents should not affect matching.
- Parameters:
text (
str) – Text potentially containing accented characters- Return type:
- Returns:
Text with accent marks removed
Examples
>>> strip_accents("José García") 'Jose Garcia' >>> strip_accents("Müller") 'Muller'
- paper_firehose.core.text_utils.strip_jats(text)[source]¶
Remove JATS/HTML tags and unescape entities in Crossref-style strings.
JATS (Journal Article Tag Suite) is an XML format used by publishers. Crossref and other APIs often return abstracts with JATS tags embedded.
- Parameters:
text (
Optional[str]) – Text potentially containing JATS/HTML tags- Return type:
- Returns:
Cleaned text with tags removed and entities unescaped, or None if input was None
Examples
>>> strip_jats("<jats:p>Some text</jats:p>") 'Some text' >>> strip_jats("Text with <angle> brackets") 'Text with <angle> brackets'