IMDB Engine and Database Questions
We have several questions regarding the IMDB engine and databases that are created/stored there.
- Is the memory required for a given database the same as the amount of disk space that the database requires? For example, if we process a model and the resulting .irdb file in the \Program Files\Pyramid 2018\repository\imdata directory is 100MB, does that mean that model requires 100MB of memory?
- Is there a general expectation for the compression ratio we might see for any given model based on the size of the source data (realizing that the compression achieved is likely a factor of the source data's column cardinality, data types, etc.)?
- When a model is being re-processed, can it still be queried during the processing operation?
- When a model is being re-processed, does the IMDB engine require twice as much memory for it (such that a new "version" of the model exists until the processing operation is complete)?
- When a model is processed, are its sources queried in parallel? If so, are tables from each source queried in parallel?
- Is there any way to process a single source or table within a model versus the entire model?
- Is there any way to process a model incrementally (based on source data partitions or separate queries, for example)?
Dave F.
1 reply
-
Hello Dave,
Sorry for the late response to this post, but I hope that some of those questions are still relevant.
1. The disk size and the memory size are not the same, when the data is inside the disk it is more compressed, but when we read it to memory it will take a bigger part compared to the disk size. For more information, you are able to go to the admin, data, source manager, click on the IMDB server, and in the general tab you are able to see pie charts that present the disk size and the memory size of the top 10 models (check attachments for more information)
2. As you said, the model size is based on the data types, cardinality, number of columns, charsets and more. We don’t have any general estimation that we can give you. A rough rule of thumb is the stored in-memory model on disk is 10% of the data footprint of the underlying relational database. When loaded to memory its roughly 20%.
3 + 4.When the model is reprocessed we build the data in batched into the disk, and only when it is done we replace it with the existing database. That means that while the model is built you will be able to query your old database, but you won't be able to run any query during the replacement. The replacement part should be much quicker than building the model.
5. When an ETL is executed, the source queries run in parallel (each thread stream data from different tables).
6. You are not able to process a single table from your whole flow. I will add it as a new feature request
7. In the latest versions of Pyramid, you are able to use variables to achieve incremental loading, for more information you can use our help center:
https://help.pyramidanalytics.com/Content/Root/MainClient/apps/Model/Data%20Flow/Variables.htm
If you have any other question let me know, I will be happy to help you.
Thanks,
Shai Ben Ami.