Understanding Tableau Data Extract (from https://www.tableau.com/about/blog/2014/7/understanding-tableau-data-extracts-part1?__src=liftigniter&__widget=blog-recs&li_source=LI&li_medium=blog-recs )
A Tableau data extract is a compressed snapshot of data stored on disk and loaded into memory as required to render a Tableau viz. There are two aspects of TDE design that make them ideal for supporting analytics and data discovery. The first is that a TDE is a columnar store.
Let’s establish the common understanding that columnar databases store column values together rather than row values. As a result, they dramatically reduce the input/output required to access and aggregate the values in a column. That’s what makes them so wonderful for analytics and data discovery.
Figure 1 - A columnar store makes it possible to quickly operate over the values in any given column
The second key aspect of TDE design is how they are structured which impacts how they are loaded into memory and used by Tableau. This is a very important part of how TDEs are “architecture aware”. Basically, architecture-awareness means that TDEs use all parts of your computer’s memory, from RAM to hard disk, and put each part to work as best fits its characteristics.
To better understand this aspect of TDEs, we’ll walk through how a TDE is created and then used as the data source for one or more visualizations.
When Tableau creates a data extract, it first defines the structure for the TDE and creates separate files for each column in the underlying source. (This is why it’s beneficial to minimize the number of data source columns selected for extract).
As Tableau retrieves data, it sorts, compresses and adds the values for each column to their respective file. With 8.2, the sorting and compression occur sooner in the process than in previous versions, accelerating the operation and reducing the amount of temporary disk space used for extract creation.
People often ask if a TDE is decompressed as it is being loaded into memory. The answer is no. The compression used to reduce the storage requirements of a TDE to make them more efficient is not file compression.
Rather, several different techniques are used, including dictionary compression (where common column values are replaced with smaller token values), run length encoding, a frame of reference encoding, and delta encoding. However, file compression can still be used to further reduce the size of a TDE if you’re planning to email or copy it to a remote location.
Figure 2 - Compression techniques are used to further optimize the TDE columnar store. Each column becomes a memory-mapped file in the TDE store
To complete the creation of a TDE, individual column files are combined with metadata to form a memory-mapped file — or to be more accurate, a single file containing as many individual memory-mapped files as there are the columns in the underlying data source. This is a key enabler of its carefully engineered architecture-awareness. (And even if the term is unfamiliar, you’ve come across memory-mapped files before. They are a feature of any modern operating system (OS).)
Because a TDE is a memory-mapped file, when Tableau requests data from a TDE, the data is loaded directly into memory by the operating system. Tableau doesn’t have to open, process or decompress the TDE to start using it. If necessary, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. This is a key point - it means that Tableau can query data that is bigger than the available RAM on a machine!
Only data for the columns that have been requested is loaded into RAM. However, there are also some other subtler optimizations. For example, a typical OS-level optimization is to recognize when access to data in a memory-mapped file is contiguous and as a result, read ahead in order to increase speed access. Memory-mapped files are also only loaded once by an OS, no matter how many users or visualizations access it. Since it isn’t necessary to load the entire contents of TDEs into memory for them to be used, the hardware requirements and so the costs of a Tableau Server deployment are kept reasonable.
Lastly, architecture-awareness does not stop with memory – TDEs support the Mac OS X and Linux OS in addition to Windows, and are 32- and 64-bit cross-compatible. It doesn’t get much better than that for a fast, in-memory data engine. If you’re interested, you can read about other important breakthrough technologies in Tableau link provided at the top of the page.