Retrieve Data at High Speed
Processing statistical data quickly is essential for delivering the business decisions in real-time. This chapter describes several fine-tuning issues to consider when using the statistical queries, provided by Machbase.
- Optimize materialization
- Check whether to use index or not
- Reduce the range of data search using time range
Table of Contents
- Changing the Size of Hash Bucket
- Reduce Time Range via DURATION Keyword
- Using EXPLAIN PLAN
- Settings of Column MIN/MAX Cache
Changing the Size of Hash Bucket
In the case of processing queries which include GROUP BY clause, DISTINCT query, or COUNT (distinct column) clause, Machbase generates hash tables. The size of the hash table is 20011 by default and it can specify up to 100MB.
The size of hash bucket that belongs to a specific session can be changed using the ALTER SESSION statement.
ALTER SESSION SET hash_bucket_size = [desired hash table size];
To change hash_bucket_size for the server, add the following command to the $MACHBASE_HOME/conf/machbase.conf.
HASH_BUCKET_SIZE = [desired hash table size]
To have better performance, it is recommended to set the size of the hash table as 1.5 times more than the number of groups generated in GROUP BY clause. If you do not know the number of the group, check the value of MTR_HASH_FIND_CONFLICT in the V$SYSSTAT table. If the value is larger than the bucket size, increase the size of hash_bucket_size.
SELECT value FROM v$sysstat WHERE name = 'MTR_HASH_FIND_CONFLICT';
If the value above is 0, there is no hash collision resulting in maximum performance. If it is not 0, it shows the number of hash collisions. If you set the property of HASH_BUCKET_SIZE property too large, it can use too much memory when you perform the query.
Reduce Time Range via DURATION Keyword
Machbase provides query search which is optimized for time series data. Each log table has _ARRIVAL_TIME column that has time value of data entry and it stores the time value of entry point.
Search Data for Specific Period Based on Current Time.
To search data for a specified time period based on the current time, use DURATION keyword.
DURATION number [YEAR | MONTH | DAY | HOUR | MINUTE | DAY | SECOND]
The statement above shows that it searches data based on time, in this case, 1 second means from a second ago to the present time, 5 minutes means from 5 minutes ago to the present time, 7 days means from 7 days ago to the present time.
Search Data for Specific Period Based on Specific Time
To search data based on the particular time, use the BEFORE and AFTER keywords. To specify a particular time, use the TO_DATE function to describe the desired time in character type, and then convert it into time data type.
To search data for one day based on the specific time, Jan 2, 2015 from Jan 1 to Jan 2, use BEFORE keyword as follows.
DURATION 1 DAY BEFORE TO_DATE('2015-01-02', 'YYYY-MM-DD')
To search data for 7 days after Jan 1, 2015, use the AFTER keyword. If you use this keyword, the data shows up in the chronological order (from past to present).
DURATION 7 DAY AFTER TO_DATE('2015-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS')
Machbase displays data from present to past by default. However, if you want to display data in reverse order (from past to present), use DURATION clause and AFTER keyword. It is much faster than using ORDER BY clause when it comes to the _ARRIVAL_TIME column. It also saves memory usage.
To get all data to be displayed from the past to the present, use
DURATION ALL AFTER TO_DATE ('1970-01-01' , 'YYYY-MM-DD').
ALL is used if period is not specified.
Search Data for a Specified Duration
To search records from a specific time to another specific time, e.g., from June 12 (13:00) to June 12 (20:00) in 2015, use DURATION FROM... TO clause. The data are displayed from the past to present if the AFTER conditional clause is used.
DURATION FROM TO_DATE('2014-6-12 13:00:00', 'YYYY-MM-DD HH24:MI:SS') TO TO_DATE('2014-6-12 20:00:00', 'YYYY-MM-DD HH24:MI:SS');
In the example below, the beginning time and ending time are not set from the past to recent. Thus, data will be displayed in reverse order, from the recent to past unlike the example above.
DURATION FROM TO_DATE('2014-6-12 20:00:00', 'YYYY-MM-DD HH24:MI:SS') TO TO_DATE('2014-6-12 13:00:00', 'YYYY-MM-DD HH24:MI:SS');
Using EXPLAIN PLAN
To check how the query plan is changed, use the EXPLAIN PLAN function, which is an machsql utility function and cannot be used in ODBC/CLI.
Display the results by executing EXPLAIN command against the query below.
Mach> EXPLAIN SELECT i1 FROM test_table WHERE i1 = 1; PLAN ------------------------------------------------------------------------------------ QPX_NODE_TYPE_PROJ *BITMAP (t:301, c:1, i:302)
This query plan has PROJ node as a top-level node and you can see the BITMAP index scan as sub node. (t is table ID, c is column ID, and i is index ID.) Below are descriptions of each node.
- Scan node: It is represented by the index scan or progressive scan node. When using more than two indexes, each index node will be displayed.
- Index scan: Types of index (BITMAP and KEYWORD) are displayed. If you search the index by RANGE clause (bitmap index only), RANGE mark is added.
- Progressive scan: It is shown as FULL SCAN, and examines the conditional clause by reading the entire column values within the specified range.
- Aggregation node: QPX_NODE_TYPE_GRAG is displayed. It processes set operations.
- Having node: QPX_NODE_TYPE_HAVING is displayed. It processes the HAVING clause.
- Sort node: QPX_NODE_TYPE_SORT is displayed. ORDER By clause appears on a given query.
- COUNTSTAR node: If you only use count (*) without any conditional clause, it is used for quick process.
- COUNTSTAR_WITH_TIME node: It is an execution node to process the query quickly using count(*) with DURATION clause.
- INNER JOIN node: It's the execution node that processes a normal join. It's displayed as QPX_NODE_TYPE_JOIN.
- OUTER JOIN node: It's an execution node that processes the outer join. It's displayed as QPX_NODE_TYPE_OUTERJOIN.
- INLINEVIEW node: It's an execution node that processes the inline view. It's displayed as QPX_NODE_TYPE_INLINEVIEW.
- PROJNODE: This is a top-level execution node that deals with the output of query.
Settings of Column MIN/MAX Cache
If it searches a large amount of data, from tens of millions to hundreds of millions of data, it is required to access tens of thousands partitions of the Machbase. Therefore, it is efficient to search the partitions that have the specific values rather than checking the entire partitions. For this, Machbase sets and keeps the values of MIN and MAX against columns of table partitions into memory and disk. And it uses the values for a high performance search.
When MIN/MAX cache is properly used, you can expect up to dozens of times of improvement in analysis and performance. You can find more information on MINMAX Cache from Table section.
If you maintain the values of MIN and MAX in the memory, it is sharing the areas which are set as the properties of DISK_COLUMN_TABLESPACE_MEMORY_MAX_SIZE. If MIN and MAX cache values are larger than the memory size, distribute it to multiple disks and store them.
Below is an example of setting the MIN/MAX cache properties.
Creating a table by setting MIN/MAX Cache
# If the value of MINMAX_CACHE_SIZE of the column is larger than 0, the specified value of the MINMAX can be saved in the cache. CREATE TABLE ctest5 (id1 integer property(MINMAX_CACHE_SIZE=100000), name varchar(100) property(MINMAX_CACHE_SIZE=0), id2 long property(MINMAX_CACHE_SIZE=100000), id3 ipv4 property(MINMAX_CACHE_SIZE=100000), id4 short);
The given MIN/MAX values are memory areas that maintain the cache in bytes. If the type is LONG (8 bytes) and the size of cache is 100,000 bytes, it's about 100m000 / 8 (the size of type) * 2(the values of MIN and MAX respectively). In other words, 5000 partitions are stored in the memory.
Changing MIN/MAX Cache after creating the table
# In the case of setting the MINMAX property in id4 column ALTER TABLE ctest5 MODIFY COLUMN id4 SET MINMAX_CACHE_SIZE = 1000000;