So I jumped onto the NoSql (Not-Only Sql) bandwagon and
off-late have been dabbling with MongoDB and
PHP. Coming from a RDBMS background, NoSql is for sure a different paradigm and
requires thinking differently. These
series of posts are on my experience with MongoDB, the problems that I’ve faced
and the solutions/workarounds that I’ve found so far. All the code samples in
these posts are on 64-bit Linux version of MongoDB 2.0.3 and PECL Mongo driver 1.2.9. These posts
assume basic familiarity with MongoDB.
Background
The reason why I started even looking outside of RDBMS was
due to loosely defined schema that I had to support. Basically, on our website
a user can submit activities, now these activities can be on different types
and the activity attributes change based on the activity type. We started off
with MySql DB and soon started realizing that supporting different schemas in a
RDBMS was a bit painful. The most popular approach for modeling such kind of
schema seems to be EAV
(as used by Magento) but I just found it to be a bit too complex. We actually
started serializing our entities into xml and dumping them into a My Sql column
but I realized that by doing so, I am not getting any benefits of using a RDBMS
(no referential integrity etc).
Why Mongo?
Once I decided that RDBMS was a no-go, I started looking out for alternatives in the NoSql world. I had tried playing with Cassandra and Hadoop/HDFS earlier but felt they had a steep learning curve. Also, I feel Cassandra and Hadoop/HDFS are more suitable for applications dealing with huge amount of data, given their distributed nature and complex processing (great support for Map-Reduce). I finally evaluated MongoDB, CouchDB and Redis – found Redis to be a glorified Key-Value store and MongoDB to be closest to a RDBMS (you can have indexes, dbrefs) without being a RDBMS, making it a little easier for people with RDBMS background to learn it.
Once I decided that RDBMS was a no-go, I started looking out for alternatives in the NoSql world. I had tried playing with Cassandra and Hadoop/HDFS earlier but felt they had a steep learning curve. Also, I feel Cassandra and Hadoop/HDFS are more suitable for applications dealing with huge amount of data, given their distributed nature and complex processing (great support for Map-Reduce). I finally evaluated MongoDB, CouchDB and Redis – found Redis to be a glorified Key-Value store and MongoDB to be closest to a RDBMS (you can have indexes, dbrefs) without being a RDBMS, making it a little easier for people with RDBMS background to learn it.
Mongo Schema Design
After picking on Mongo and installing it, the first choice
that you have to make is on the schema design and how to define relationshipts
between entities . There are two ways in which you can define relationships –
- Embedding: A document becomes a subdocument of another document. You can embed as many levels deep as you wish. Warning – Embedding generally works great if you are embedding only up to one level as Mongo currently does not have good support for querying/updating attributes which are nested multiple levels deep – more on it in a subsequent post where-in I ran into the issue with $ operator.
2.
Linking: Documents are different entities (part
of different collections) and are linked by their MongoIds (_Id). Enforcing the
relational integrity is primarily the responsibility of the client application.
Based on the docs, I also
followed the same principle: if a relationship between two entities is “Composition” , I
embed sub- entitiy subdocument else I add it to a different collection and link
them using MongoId. Below is one example of composition and linking –
>
db.activities.find({}).pretty();
activityTitle:”This is example”,
tags:[“tag1”,”tag2”,”tag2”],
submittedBy:23
….
Here tag is a subdocument (1:n relationship), whereas
submittedBy has the reference to the Id of the user stored in db.users. Tag by
itself can be a complex object (i.e. it can have its own attributes like
tag:[{tagId:1,name:”tag1”,submittedBy:23},…].
That’s pretty much on the schema design; currently our
schema design is pretty straightforward with collections for “top-level”
entities like activities, users, keywords (more on this later) and subdocuments
for sub-entities like tags.
One caveat: Mongo
column names are case sensitive, which is different from MySql. Quite a few
times in the beginning, I have had the col name in wrong case and wasted bunch
of time trying to figure out what’s wrong with my query.
ODM (Object Document
Mapper) Strategy
Once the Mongo collections are finalized
and the PHP DTO/Models (Data Transfer Object) classes defined, the next step is
to figure out how are we going to map our models to Mongo collections and
retrieve/persist the same? The PECL driver is pretty flexible but only support
retrieving/persist PHP associative arrays, so there has to be an adapter in
between which converts our PHP class into an associative array. There are few
mappers available already like Doctrine
and Php-ODM
but somehow I wasn’t very comfortable with them – doctrine: seemed to be a bit
too heavy, whereas Php-ODM relies on storing property in an internal array;
what we wanted was a way to store protected variables and have getters and
setters so that we can validate the values and typecast them. Also, we had the
need for being able to return only a subset of properties (in case of partial
update). Our implementation is pretty simple: every model inherits from
BaseModel which has a method toArray(). The toArray() just calls
get_object_vars() and takes two optional arrays as parameter: includeItems and
excludeItems. Below is the implementation of our toArray() method –
Our Models have protected variables for properties which need to be serialized and override the toArray() method if needed.
That's pretty much it for our ODM
That's pretty much it for our ODM