Design Patterns for MongoDB
Since the dawn of computing, data is ever growing — this has a direct impact on the needs for storage, processing and analytics technologies. The past decade, developers have moved from SQL to NoSQL databases, with MongoDB being dominant in terms of popularity, as an operational data store in the world of enterprise applications.
If you have read any of my recent articles or know me in person you may realise how much I value software architecture and patterns. Most people think that they are only applicable on the server side. I truly believe though that the backend design should not be an afterthought, but a key part of the architecture. Bad design choices are explicitly affecting the solution’s scalability and performance.
As such today I will introduce you to a few practical MongoDB design patterns that any full stack developer should aim to understand, when using the MERN/MEAN collection of technologies:
❗️Assumption: Basic familiarity with MongoDB is necessary, so is some understanding of relational modelling (because we will refer to SQL as a contrasting approach).
The Grand Scheme (or Schema 😄) of Things
Often, we think about MongoDB as a schema-less database, but this is not quite true! It does have schema, but it is dynamic i.e. the schema is not enforced on documents of the same collection, but contrary it has the ability to change and morph; that is why it is called polymorphic schema. What it means is that diverse datasets can be stored together, which is a competitive advantage for the booming unstructured big data.
Inheritance & Polymorphism
Especially when it comes to Object Oriented Programming (OOP) and inheritance, the polymorphic capability of MongoDB becomes very handy, as developers can serialise instances of varying classes of the same hierarchy (parent-child) to the same collection, and then deserialise them back to objects.
This is not very straight forward in relational databases as tables have fixed schemas. For example, consider a trading system: A Security
base class can be derived as Stock
, Equity
, Option
etc.
While in MongoDB we can store the derived types in a single collection called Security
and add on each document a discriminator (_t
), in a relational database we have these modelling choices:
- Single table with the union of the fields for
Stock
,Equity
,Option
, resulting in a sparsely populated schema. - Three tables, one for each concrete implementation of
Stock
,Equity
,Option
, resulting in redundancy (as there is repetitive base information of theSecurity
attributes), as well as complicated queries to retrieve all types of securities. - One table for
Security
for the common content, and three tables forStock
,Equity
,Option
that will have aSecurityID
and will only contain the respective attributes. This option solves the redundancy issue, but still the query can get complex.
As you can see there is a lot more code involved than in a polymorphic MongoDb collection!
Schema Evolution
The only thing constant in life is change — this certainly holds true to a database schema and it often poses challenges and a few headaches when it comes to traditional relational database systems. The Achilles heel of a tabular schema that has been carefully engineered to be normalised by eliminating redundancy is that a small change to one table can cause a ripple of changes across the database and can spill into the server-side application code too.
A typical approach is to stop the application, take a backup, run complex migration scripts to support the new schema, release the new version of the application to support the new schema and restart the application. With continuous deployment (CD) taking care of the application side of the release, the most time consuming task, requiring lengthy downtime, is pinned down to the database migration. Some ALTER
commands executed on large tables can even take days to complete…
In MongoDB however, backwards compatibility comes out of the box, so developers account for these changes in the server-side code itself. Once the application is updated to handle the absence of a field, we can migrate the collection in question in the background while the application is still running (assuming there is more than a single node involved). When the entire collection is migrated, we can replace our application code to truly forget the old field.
Database design is not something that is written in stone and schema changes can be vexing (if not paralysing) in legacy tabular databases, so the polymorphic feature of MongoDB is very powerful indeed.
To Embed or Not to Embed: That is The Question!
If you have any OOP experience, you must have come across in your career, Eric Evan’s classic book Domain Driven Design that introduces the aggregate models. An aggregate is a collection of data that we interact with as a unit, and normally has more complex structure than a traditional row/record i.e. it can hold nested lists, dictionaries or other composite types.
Atomicity is only supported within the contents of a single aggregate; in other words the aggregate forms the boundary of an ACID operation (read more on the MongoDB manual). Handling inter-aggregate relationships is more difficult than intra-aggregate ones: joins are not supported directly inside the kernel, but are managed in the application code or with the somewhat complex aggregation pipeline framework.
In essence there is a fine balance on whether to embed related objects within one another or reference them by ID, and as most things in modelling there is not a one-stop solution on how to make this decision. It is very much context specific as it depends on how the application interacts with the data.
Before we proceed, we need to understand what the advantages of embedding are:
🔴 The main reason for embedding documents is read performance which is connected to the very nature of the way computer disks are built: when looking for a particular record, they may take a while to locate it (high latency); but once they do, accessing any additional bytes happens fast (high bandwidth). So collocating related information makes sense as it can be retrieved in one go.🔴 Another aspect is that it reduces the round trips to the database that we had to program in order to query separate collections.
Now let’s explore some points to ponder when designing our MongoDB schema, based on the type of relationship that two entities have:
1:1
A One-to-One relationship is a type of cardinality that describes the relationship between two entities where one record from entity A is associated with one record in entity B. It can be modelled in two ways: either embedding the relationship as a sub-document, or linking to a document in a separate collection (no foreign key constraints are enforced so the relation only exists in the application level schema). It all depends upon how the data is being accessed by the application, how frequently and also the lifecycle of the dataset (i.e. if entity A is deleted, does entity B still have to exist?)
Golden Rule #1: If an object B needs to be accessed on its own (i.e. outside the context of the parent object A) then use reference, otherwise embed.
1:N
A One-to-Many relationship refers to the relationship between two entities A and B, where one side can have one or more links to the other, while the reverse is single sided. Like a 1:1 relationship, it can also be modelled by leveraging embedding or referencing.
Here are the main considerations to take into account:
If a nested array of objects is to increase uncontrollably, embedding is not recommended as:
- Each document cannot exceed 16MB.
- New space needs to be allocated for the growing document and also indices need to be updated which impacts the write performance.
In this case referencing is preferred and entities A and B are modelled as stand-alone collections. However one trade off is that we will need to perform a second query to get the details of entity B, so read performance might be impacted. An application-level join comes to the rescue: with the correct indexing (for memory optimisation) and the usage of projections (for network bandwidth reduction) the server-side joins are slightly more expensive than the ones pushed to the DB engine. The $lookup
operator should be required infrequently. If we need it a lot, there is a schema-smell!
Another option is to use pre-aggregated collections (acting as OLAP cubes) to simplify some of these joins.
Golden Rule # 2: Arrays should not grow without bound.
- If there are less than a couple of hundred narrow documents on the B side, it is safe to embed them;
— If there are more than a couple of hundred documents on the B side, we don’t embed the whole document; we link them by having an array of B objectID references;
— If there are more than a few thousand documents on the B side, we use a parent-reference to the A-side in the B objects.
and
Golden Rule # 3: Application-level joins are common practice and not to be frowned upon; in these cases indexing selection makes or breaks the querying performance.
❗️Denormalisation: Two contributing factors to denormalise our documents are:
- Updates will not be atomic any more;
- High read-to-write ratio (i.e. a field that is mostly read and rarely updated is a good candidate for denormalisation).
N:M
A Many-to-Many relationship refers to the relationship between two entities A and B, where both sides can have one or more links to the other. In relational databases, these cases are modelled with a junction-table, however in MongoDB we can use bi-directional embedding, so we query A to search for the embedded references of the B objects, and then query B with an$in
operator to find these returned references (the reverse is also possible). And vice versa.
Here the complexity arises from establishing an even balance between A and B, as the 16MB threshold can also be broken. In these instances, one-way embedding is recommended.
Comments
Post a Comment
Please Share Your Views