Storing Ingestion Status in Firestore

There are two ways a file can be ingested from GCS - via pubsub (fired when a new GCS file is created) or via a catchup loop which scans the last N days worth of files and makes a synthetic event for any that were missed.

We need to keep track of which files have been ingested and which have not to avoid extra effort. In general, ingesters should be tolerant of seeing the same file multiple times, but by tracking which have been ingested already, we can keep these duplicated events to a minimum (e.g. event fires exactly as when the scanner processes a file).

A core assumption of Gold is that files won‘t change after uploading. We include the MD5 hash in all calculations on the off chance there are changes (and we can’t just use the MD5 hash for fear of collisions), but that's not the primary consideration for this store.

Schema

In the spreadsheet metaphor, Firestore Collections are tables and Documents are the rows, with the fields of the Documents being the columns.

The schema here is pretty straightforward, with the following ingestedEntry Documents:

ID           string  # autogenerated
IngestedFile string  # FileName + "|" + MD5Hash

Indexing

Simple Indices should be fine.

Usage

We simply query a given filename + md5 hash combination and see if it exists. No need to cache anything unless it becomes a performance bottleneck.