paper_firehose.core.database

Database management for the three-database approach: - all_feed_entries.db: All RSS entries for deduplication - matched_entries_history.db: Historical matches across all topics - papers.db: Current run processing data

Classes

DatabaseManager(config)

Manages the three-database system for feed processing.

class paper_firehose.core.database.DatabaseManager(config)[source]

Bases: object

Manages the three-database system for feed processing.

Parameters:

config (Dict[str, Any])

backup_important_databases()[source]

Backup history and all_feeds databases with timestamped rotation.

  • Writes timestamped backups alongside the source DBs in the runtime data directory.

  • Keeps up to 3 most recent backups per database, pruning older ones.

Returns a dict mapping logical db keys to the created backup file paths.

Return type:

Dict[str, str]

clear_current_db()[source]

Clear the current run database.

Re-initialises the FTS index and triggers before deleting rows so that DELETE triggers can fire cleanly even if a previous migration or crash left the FTS table missing.

close_all_connections()[source]

Close any open database connections (placeholder for connection pooling).

compute_entry_id(entry)[source]

Generate a stable SHA-1 based ID for a feed entry.

Return type:

str

Parameters:

entry (Dict[str, Any])

get_connection(db_key='current', row_factory=True)[source]

Context manager for database connections with automatic commit/rollback.

Parameters:
  • db_key (str) – Which database to connect to (‘current’, ‘history’, ‘all_feeds’)

  • row_factory (bool) – If True, use sqlite3.Row factory for dict-like row access

Yields:

sqlite3.Connection – Database connection

Example

with db.get_connection() as conn:

cursor = conn.cursor() cursor.execute(“SELECT * FROM entries”) # Auto-commits on success, auto-closes always

get_current_entries(topic=None, status=None)[source]

Get entries from papers.db with optional filtering.

Return type:

List[Dict[str, Any]]

Parameters:
get_entries_by_criteria(topic=None, min_rank=None, status=None, has_doi=None, order_by='rank_score DESC')[source]

Flexible query builder for entries with various criteria.

Parameters:
  • topic (Optional[str]) – Optional topic filter

  • min_rank (Optional[float]) – Optional minimum rank score

  • status (Optional[str]) – Optional status filter

  • has_doi (Optional[bool]) – If True, only entries with DOI; if False, only without DOI

  • order_by (str) – ORDER BY clause (default: ‘rank_score DESC’)

Return type:

List[Row]

Returns:

List of sqlite3.Row objects with dict-like access

is_new_entry(title)[source]

Check if an entry is new (title not in all_feed_entries.db).

Return type:

bool

Parameters:

title (str)

iter_history_entries(entry_ids)[source]

Iterator for history entries by ID.

Parameters:

entry_ids (List[str]) – List of entry IDs to fetch

Yields:

sqlite3.Row – Database rows with dict-like access

Return type:

Iterator[Row]

iter_targets(topic=None, min_rank=None)[source]

Iterator for entries that need abstract fetching.

Parameters:
  • topic (Optional[str]) – Optional topic filter (if None, fetches all topics)

  • min_rank (Optional[float]) – Optional minimum rank score filter

Yields:

sqlite3.Row – Database rows with dict-like access

Return type:

Iterator[Row]

purge_old_entries(days)[source]

Remove entries from the most recent N days (including today) based on publication date (YYYY-MM-DD).

Parameters:

days (int)

query_entries(db_key='current', topic=None, min_rank=None, status=None, has_doi=None, has_abstract=None, since=None, until=None, search=None, fuzzy=None, order_by='rank_score DESC', limit=20, offset=0)[source]

General-purpose query across any of the three databases.

Parameters:
  • db_key (str) – 'current', 'history', or 'all_feeds'

  • topic (Optional[str]) – Topic filter (exact match for current, LIKE for history)

  • min_rank (Optional[float]) – Minimum rank_score threshold

  • status (Optional[str]) – Status filter (current DB only)

  • has_doi (Optional[bool]) – If True only entries with DOI, if False only without

  • has_abstract (Optional[bool]) – If True only entries with abstract

  • since (Optional[str]) – Published on or after this date (YYYY-MM-DD)

  • until (Optional[str]) – Published on or before this date (YYYY-MM-DD)

  • search (Optional[str]) – FTS5 keyword search on title + abstract/summary (supports phrases "...", prefix term*, boolean AND/OR/NOT)

  • fuzzy (Optional[str]) – Fuzzy text search via FTS5 trigram (min 3 chars, mutually exclusive with search)

  • order_by (str) – SQL ORDER BY clause

  • limit (int) – Max rows (0 = unlimited)

  • offset (int) – Skip first N rows

Return type:

tuple

Returns:

(rows, total_count) where rows is a list of dicts and total_count is the count before LIMIT/OFFSET.

save_current_entry(entry, feed_name, topic, entry_id)[source]

Save an entry to papers.db for current run processing.

Parameters:
save_feed_entry(entry, feed_name, entry_id)[source]

Save an entry to all_feed_entries.db with proper date formatting.

Parameters:
save_matched_entry(entry, feed_name, topic, entry_id)[source]

Save a matched entry to matched_entries_history.db, merging topics if entry already exists.

Parameters:
update_abstracts_batch(updates)[source]

Batch update abstracts for multiple entries.

Parameters:

updates (List[tuple]) – List of (abstract, doi, entry_id, topic) tuples

Return type:

int

Returns:

Number of rows updated

update_entry_rank(entry_id, topic, score, reasoning=None)[source]

Update rank_score (and optionally rank_reasoning) for a single entry.

Parameters:
  • entry_id (str) – Entry identifier (sha1 or normalized link id)

  • topic (str) – Topic name (composite key component)

  • score (float | None) – Rank score to persist (cosine similarity or None)

  • reasoning (str | None) – Optional concise reasoning string

Return type:

None

update_history_abstracts_batch(updates)[source]

Batch update abstracts in history database.

Parameters:

updates (List[tuple]) – List of (abstract, doi, entry_id) tuples

Return type:

int

Returns:

Number of rows updated

update_history_rank(entry_id, score)[source]

Update the historical rank_score, keeping the highest score seen.

Return type:

None

Parameters: