paper_firehose.core.database¶

Database management for the three-database approach: - all_feed_entries.db: All RSS entries for deduplication - matched_entries_history.db: Historical matches across all topics - papers.db: Current run processing data

Classes

DatabaseManager(config)

Manages the three-database system for feed processing.

class paper_firehose.core.database.DatabaseManager(config)[source]¶

Bases: object

Manages the three-database system for feed processing.

Parameters:: config (Dict[str, Any])

backup_important_databases()[source]¶

Backup history and all_feeds databases with timestamped rotation.

Writes timestamped backups alongside the source DBs in the runtime data directory.
Keeps up to 3 most recent backups per database, pruning older ones.

Returns a dict mapping logical db keys to the created backup file paths.

Return type:: Dict[str, str]

clear_current_db()[source]¶

Clear the current run database.

Re-initialises the FTS index and triggers before deleting rows so that DELETE triggers can fire cleanly even if a previous migration or crash left the FTS table missing.

close_all_connections()[source]¶: Close any open database connections (placeholder for connection pooling).

compute_entry_id(entry)[source]¶

Generate a stable SHA-1 based ID for a feed entry.

Return type:: str
Parameters:: entry (Dict[str, Any])

get_connection(db_key='current', row_factory=True)[source]¶

Context manager for database connections with automatic commit/rollback.

Parameters:

db_key (str) – Which database to connect to (‘current’, ‘history’, ‘all_feeds’)
row_factory (bool) – If True, use sqlite3.Row factory for dict-like row access

Yields:

sqlite3.Connection – Database connection

Example

with db.get_connection() as conn:: cursor = conn.cursor() cursor.execute(“SELECT * FROM entries”) # Auto-commits on success, auto-closes always

get_current_entries(topic=None, status=None)[source]¶

Get entries from papers.db with optional filtering.

Return type:

List[Dict[str, Any]]

Parameters:

topic (str)
status (str)

get_entries_by_criteria(topic=None, min_rank=None, status=None, has_doi=None, order_by='rank_score DESC')[source]¶

Flexible query builder for entries with various criteria.

Parameters:

topic (Optional[str]) – Optional topic filter
min_rank (Optional[float]) – Optional minimum rank score
status (Optional[str]) – Optional status filter
has_doi (Optional[bool]) – If True, only entries with DOI; if False, only without DOI
order_by (str) – ORDER BY clause (default: ‘rank_score DESC’)

Return type:

List[Row]

Returns:

List of sqlite3.Row objects with dict-like access

is_new_entry(title)[source]¶

Check if an entry is new (title not in all_feed_entries.db).

Return type:: bool
Parameters:: title (str)

iter_history_entries(entry_ids)[source]¶

Iterator for history entries by ID.

Parameters:: entry_ids (List[str]) – List of entry IDs to fetch
Yields:: sqlite3.Row – Database rows with dict-like access
Return type:: Iterator[Row]

iter_targets(topic=None, min_rank=None)[source]¶

Iterator for entries that need abstract fetching.

Parameters:

topic (Optional[str]) – Optional topic filter (if None, fetches all topics)
min_rank (Optional[float]) – Optional minimum rank score filter

Yields:

sqlite3.Row – Database rows with dict-like access

Return type:

Iterator[Row]

purge_old_entries(days)[source]¶

Remove entries from the most recent N days (including today) based on publication date (YYYY-MM-DD).

Parameters:: days (int)

query_entries(db_key='current', topic=None, min_rank=None, status=None, has_doi=None, has_abstract=None, since=None, until=None, search=None, fuzzy=None, order_by='rank_score DESC', limit=20, offset=0)[source]¶

General-purpose query across any of the three databases.

Parameters:

db_key (str) – 'current', 'history', or 'all_feeds'
topic (Optional[str]) – Topic filter (exact match for current, LIKE for history)
min_rank (Optional[float]) – Minimum rank_score threshold
status (Optional[str]) – Status filter (current DB only)
has_doi (Optional[bool]) – If True only entries with DOI, if False only without
has_abstract (Optional[bool]) – If True only entries with abstract
since (Optional[str]) – Published on or after this date (YYYY-MM-DD)
until (Optional[str]) – Published on or before this date (YYYY-MM-DD)
search (Optional[str]) – FTS5 keyword search on title + abstract/summary (supports phrases "...", prefix term*, boolean AND/OR/NOT)
fuzzy (Optional[str]) – Fuzzy text search via FTS5 trigram (min 3 chars, mutually exclusive with search)
order_by (str) – SQL ORDER BY clause
limit (int) – Max rows (0 = unlimited)
offset (int) – Skip first N rows

Return type:

tuple

Returns:

(rows, total_count) where rows is a list of dicts and total_count is the count before LIMIT/OFFSET.

save_current_entry(entry, feed_name, topic, entry_id)[source]¶

Save an entry to papers.db for current run processing.

Parameters:

entry (Dict[str, Any])
feed_name (str)
topic (str)
entry_id (str)

save_feed_entry(entry, feed_name, entry_id)[source]¶

Save an entry to all_feed_entries.db with proper date formatting.

Parameters:

entry (Dict[str, Any])
feed_name (str)
entry_id (str)

save_matched_entry(entry, feed_name, topic, entry_id)[source]¶

Save a matched entry to matched_entries_history.db, merging topics if entry already exists.

Parameters:

entry (Dict[str, Any])
feed_name (str)
topic (str)
entry_id (str)

update_abstracts_batch(updates)[source]¶

Batch update abstracts for multiple entries.

Parameters:: updates (List[tuple]) – List of (abstract, doi, entry_id, topic) tuples
Return type:: int
Returns:: Number of rows updated

update_entry_rank(entry_id, topic, score, reasoning=None)[source]¶

Update rank_score (and optionally rank_reasoning) for a single entry.

Parameters:

entry_id (str) – Entry identifier (sha1 or normalized link id)
topic (str) – Topic name (composite key component)
score (float | None) – Rank score to persist (cosine similarity or None)
reasoning (str | None) – Optional concise reasoning string

Return type:

None

update_history_abstracts_batch(updates)[source]¶

Batch update abstracts in history database.

Parameters:: updates (List[tuple]) – List of (abstract, doi, entry_id) tuples
Return type:: int
Returns:: Number of rows updated

update_history_rank(entry_id, score)[source]¶

Update the historical rank_score, keeping the highest score seen.

Return type:

None

Parameters:

entry_id (str)
score (float | None)