home

A Common Data Model

2018-08-28

In a variety of contexts, we need to record a statement about the world, along with its relations to other statements.

We need to specify, for instance, a document (with this version) had this word, at this time. This word is linked to other documents with their versions. These documents can have notes tagged to them; these notes also have words, which also can be searched against (but have, let's say, a different label - the "notes" label, as opposed to the "content" label).

Similarly, companies often want to track build history from check in to final artifact. Documents have versions and other metadata, which are then linked to compilation logs, then to artifacts, then to the deployment environment. Each is linked to each other, but has different labels.

Similarly, one might want to track a political group over time, with connections to other political groups - along with any comments on the connections and their durations.

Generally, let's say that we have atoms, with relations to other atoms. Both atoms and relations can have 0 or more labels.

Let us thus say, for the sake of abstraction, we have a graph composed of V x E, with both V and E having labels L. We would like to query the graph based on v relations, e relations, and l relations.

An initial implementation will be done in-memory as a prototype, with a suitable API laid in on top.

This model will doubtlessly require adjustment; it is entirely plausible that vertices will require an arbitrary deep metadata store that relations will not. The initial in-memory implementation will allow fluid alterations without the tedium of database changes, but as I have chosen to write it in Scala, the types will be consistent, with good opportunity for refactoring down the road.

To be clear, this is intentionally a not a text search system, e.g. an inverted index, nor is it a log search system ala ElasticSearch. A relational model best describes each instance of the problem, but the problem is uncomfortably general. It is closest to a graph database model; earlier work that I've done in non-abstract instances of this problem indicate that a graph database is a good specialized backend.

I leave with this data model as a starting point -

case class Label(val s: String)  
 
case class Atom(pk: UUID, s: String, labels: Set[Label])  
 
case class Relation(pk: UUID, s: String, from: Atom, to: Atom, labels: Set[Label]) 

enjoy!