Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
HIVE-24388: Enhance swo optimizations to merge EventOperators #1750
Conversation
…re dump and custom configs commit 7ad4c9d Author: Stamatis Zampetakis <zabetak@gmail.com> Date: Mon Jun 15 21:09:14 2020 +0200 HIVE-23965: Improve plan regression tests using TPCDS30TB metastore dump and custom configs 1. Add new perf driver, TestTezTPCDS30TBCliDriver, relying on a dockerized metastore. 2. Use Dockerized postgres metastore with TPC-DS 30TB dump 3. Remove old drivers (with and without constraints), related classes (e.g., MetastoreDumpUtility), and resources. After discussion in the JIRA we decided that keeping the old drivers is most likely useless. 4. Use Hive config properties obtained and curated from real-life usages 5. Allow AbstractCliConfig to override metastore DB type 6. Rework CorePerfCliDriver to allow pre-initialized metastores Remove system property settings in the initialization of the driver and leave in the configuration to set it up if needed. This is necessary to be able to use the driver with a preinitialised metastore. Remove redundant logs in System.err. Logging and throwing an exception is an anti-pattern. Replace assertions with exceptions and improve the messages. 7. Upgrade postgres JDBC driver to version 42.2.14 to be compatible with the docker image used 8. Update path to new postgres version in TestBeelineArgParsing 9. Display plans using hive.explain.user set to false which seems to be preferrable for performing a diff and checking regressions 10. Disable queries 14 (HIVE-24167), 30 (HIVE-23964) 11. Re-enable CBO plan tests for queries 44, 45, 67, 70, 86 The queries were disabled as part of HIVE-20718. They were supposed to be fixed in Calcite 1.18.0 and currently Hive is in 1.21.0 so it is not surprising that they pass. 12. Add missing queries: cbo_query41, cbo_query62, query62 commit 4ee5336 Author: Stamatis Zampetakis <zabetak@gmail.com> Date: Tue Nov 17 11:13:22 2020 +0100 HIVE-24395: Intermittent failures to initialize dockerized Postgres metastore in tests Ensure the Postgres init process is complete the database port is open and accepting connections before declaring the container ready. commit a2815c2 Author: Stamatis Zampetakis <zabetak@gmail.com> Date: Tue Jun 16 22:17:10 2020 +0200 HIVE-23742: Remove unintentional execution of TPC-DS query39 in qtests
… metastore dump and custom configs" This reverts commit 5576d77.
…ilter out inbvalid cases" This reverts commit 3d74bb5.
…e vill filter out inbvalid cases"" This reverts commit 9d0dd60.
This reverts commit 80fa406.
|
@jcamachor could you please take a look? |
| @@ -17,6 +17,7 @@ | |||
| */ | |||
| package org.apache.hadoop.hive.ql.optimizer; | |||
|
|
|||
| import java.io.File; | |||
jcamachor
Dec 9, 2020
Contributor
Needed?
Needed?
kgyrtkirk
Dec 10, 2020
Author
Member
yeah...I've some stuff which write out File-s for now...and I just remove it at the end of preparing the patch....
it gives easier insights into what happened in the SWO.
I would really like to build it into the system and expose it on the HS2 web interface - it would be very usefull...will get to that as well :)
but first I would like to finish this set of patches :)
yeah...I've some stuff which write out File-s for now...and I just remove it at the end of preparing the patch....
it gives easier insights into what happened in the SWO.
I would really like to build it into the system and expose it on the HS2 web interface - it would be very usefull...will get to that as well :)
but first I would like to finish this set of patches :)
| Filter Operator [FIL_68] | ||
| predicate:ss_sold_date_sk is not null | ||
| TableScan [TS_0] (rows=123457 width=14) | ||
| default@x1_store_sales,s,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk"] |
jcamachor
Dec 9, 2020
Contributor
Could merging these TS operators be effectively worse?
For instance, in this specific mock query, no partition will be pruned for the x1_store_sales table, while before partition pruning was kicking in. Thus, in this case, you are scanning the same data whether you have one or two TS operators (two partitions), however after merging the TS, the size of the data you are shuffling for the join doubles (data in both partitions twice)? Is that analysis correct?
Off the top of my head, this could be beneficial if i) both TS only select a small subset of the partitions in the table, or ii) overlapping in the partition list for those two different TS is greater than a certain threshold. Should we work in that direction, i.e., introduce some config parameters for this?
@rbalamohan , what is your take? It would be helpful to have a second opinion.
Could merging these TS operators be effectively worse?
For instance, in this specific mock query, no partition will be pruned for the x1_store_sales table, while before partition pruning was kicking in. Thus, in this case, you are scanning the same data whether you have one or two TS operators (two partitions), however after merging the TS, the size of the data you are shuffling for the join doubles (data in both partitions twice)? Is that analysis correct?
Off the top of my head, this could be beneficial if i) both TS only select a small subset of the partitions in the table, or ii) overlapping in the partition list for those two different TS is greater than a certain threshold. Should we work in that direction, i.e., introduce some config parameters for this?
@rbalamohan , what is your take? It would be helpful to have a second opinion.
kgyrtkirk
Dec 10, 2020
Author
Member
I think right now we don't have a sanity check filter right before the TS to limit reduce shuffled data size to the previous amount; but reducing the number of scans is beneficial - IIRC in the q23 query the scanned partitions were the same; so the benefit was real.
If we make available these things the same way as we have the SJ filters - then we could for sure avoid shuffling more data.
Note: I think it would enable some further opportunities if we would change the SJ data transmission method from RS to EVENTOP - that way we shouldn't have to worry about parallel edges anymore...
I think right now we don't have a sanity check filter right before the TS to limit reduce shuffled data size to the previous amount; but reducing the number of scans is beneficial - IIRC in the q23 query the scanned partitions were the same; so the benefit was real.
If we make available these things the same way as we have the SJ filters - then we could for sure avoid shuffling more data.
Note: I think it would enable some further opportunities if we would change the SJ data transmission method from RS to EVENTOP - that way we shouldn't have to worry about parallel edges anymore...
rbalamohan
Dec 10, 2020
Contributor
however after merging the TS, the size of the data you are shuffling for the join doubles
Yes, this is possible with merging. +1 on adding an option for enabling it.
There is additional option to reduce number of group by operator evaluations (e.g Q65 computes group by operator twice) which can help in reducing runtime.
however after merging the TS, the size of the data you are shuffling for the join doubles
Yes, this is possible with merging. +1 on adding an option for enabling it.
There is additional option to reduce number of group by operator evaluations (e.g Q65 computes group by operator twice) which can help in reducing runtime.


What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?