The Doeswijk Data Model
by Hu Yoshida on May 11, 2009
As part of my job, I try to find ways to communicate the benefits and challenges of technology and business models, so I am always on the lookout for more effective ways to do this.
Last week I visited a customer where their storage architect described their data and storage growth with the Doeswijk Data Model. He said that now there were three dimensions for data growth that have to be considered together. The first is primary production data growth which applications people try to estimate as best they can. The second is growth in copies or replicas which very few people out side of storage administrator worry about. And the third is retention data, static data that should be archived. Total data volume is the product of these three dimensions. Primary data requires copies for many purposes like backup, development test, data mining, extract/translate/load, data distribution, etc. At some time the majority of primary data goes stale but needs to be retained in an archive where backup is no longer required as long as there is at least one copy for recovery purposes. These dimensions can be put on different cost and performance tiers.
It is important to think of data in these terms since it helps to understand why we are always running out of storage. Many application users only plan for the production phase of their data and have no clue about the number of copies it would take to protect, analyze, and share their data. Most of them do not care what happens after the end of the productive period since they are off to a new application and someone else has to worry about keeping or deleting the data they produced. A change in any of these dimensions has a multiplying effect on the total data volume. For instance if the dimensions of a cube is 2x2x2 = 8, if we change one dimension to 3 as in 3x2x2 the total volume is 12.
Data resides in storage, so storage capacity can be represented by a cube that contains the data volume cubes. The dimensions of a storage capacity cube have a relation to the dimensions of the data volume cube. In some cases these dimensions might be tiers of storage which relate to corresponding data dimensions. Usually the storage capacity volume is bigger than the data volume, but with new technologies like de-dupe, thin provisioning, compression, copy on write, etc., storage capacity may be smaller than the data volume.
Maartin Doeswijk is the young Dutch storage architect that first showed me this model. So rather than using his model in my work and having people think that I may have devised this, I am calling this model the Doeswijk Data Model.
Here is Doeswijk’s original version of this data model which I am publishing with his permission.
Comments (9 )
Although some of the assumptions from Maartin are correct I cannot agree with this, somewhat, simplistic model.
What this model does not take into account is the value of the data and the amount that has to be retained and/or copied.
If I have a value of X TB for production data by no means it will guarantee that al of this data needs to be copied for some reason and/or retained for a long period of time because of legislation or other requirements.
I’ve written an article a long time ago (http://massstorage.blogspot.com/2007/02/future-of-storage.html) around the business value of data and how that should affect the determination of tiering, protection and retainment and be able to apply the all encompassing hype Data Lifecycle Management.
The formula’s Maartin uses in his model are 2x2x2 multplication however when the model I use is applied these metrics get a different value based on the business requirements. The only way to achieve date is to quantify the business value of the data and apply that to one of those metrics. It then also becomes a 4 dimensional calculation which gets somewhat more complex.
Erwin van Londen
Yes, also Dutch.
BTW it is Maarten Doeswijk. Sorry for making the same mistake. The name sounded firmiliar but that because some of my former collueges at HDS in Holland has strong affiliation with Maarten in his role at ING.
Sorry Maarten. Ik zal de volgende keer beter uitkijken.
Erwin van Londen
Very interesting. A good first cut for predicting growth and what kind of growth. Erwin’s right, but I see this model most useful as a driectional starting place. Resulting specific plans for each axis will answer his questions.
Thanks for your comments. Erwin, please do not read too much into this model. It is simply a way to call attention to the fact that data is being copied many times over and more of this data is being retained. Many application developers give a projection for the amount of production data that will be generated but do not consider the storage required to hold the replicas and retention. Operations need to plan for this or they will always be running out of capacity. This needs to be considered when planning for the cost of a managed GB.
The 2x2x2 example was mine and was used only to show how replicas and retention capacity and costs can explode.
Pete Steege has taken the right perspective. It is a good first cut and is useful as a directional starting place. From there you can apply your business value analysis as to what needs replicas and what needs retention. Data classification exercises usually address the cost tiers of storage for the primary volume without consideration for replicas and retentions.
What does HDS not get in the Data Domain bidding war?
It’s bad enough we missed the boat on Diligent.
I would like to get your take on the Dedupe market.
HDS FAN waiting for the big H to wake up!
A model will never exactly match reality. I was looking for a way to express and display the impact of three different factors which are responsible for storage growth. Obviously the duplication and retention axis will never be the same for each type of data or information. But I noticed when you calculate averages that they seem to be rather consistent. Only major changes seem to impact them and even then they can be predicted up to a certain level.
An added difficulty that you haven’t mentioned here is that the business value and requirements differ and change through time. Again you can try to factor all these effects in but it will get very complex very fast. (and try to explain that to a senior manager)
Hence I looked towards a kind of fuzy-logic approach and keep it at a higher level since certain effects can cancel eachother out. I have seen this in the backup-restore area and the factors there are very consistent.
Interesting article. Keeping data as the main, showing the retention, growth and duplication in the three axis is understandable. But this model will not used for practical purpose. For the storage growth the duplication and retention will not be the same.
[...] The Doeswijk Data Model presents a wonderful way of thinking about storage capacity and growth. [...]
[...] the week by catching up with Maarten Doeswijk, the creator of the Doeswijk Data Model that I blogged about several years ago. He uses a cube to represent the three dimensions of data growth: primary data, replicas of the [...]