close
The Wayback Machine - https://web.archive.org/web/20201211181507/https://github.com/apache/hive/pull/1750
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24388: Enhance swo optimizations to merge EventOperators #1750

Merged
merged 22 commits into from Dec 11, 2020

Conversation

@kgyrtkirk
Copy link
Member

@kgyrtkirk kgyrtkirk commented Dec 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

kgyrtkirk added 17 commits Nov 4, 2020
(cherry picked from commit fbfe280)
…re dump and custom configs

commit 7ad4c9d
Author: Stamatis Zampetakis <zabetak@gmail.com>
Date:   Mon Jun 15 21:09:14 2020 +0200

    HIVE-23965: Improve plan regression tests using TPCDS30TB metastore dump and custom configs

    1. Add new perf driver, TestTezTPCDS30TBCliDriver, relying on a dockerized metastore.
    2. Use Dockerized postgres metastore with TPC-DS 30TB dump
    3. Remove old drivers (with and without constraints), related classes
    (e.g., MetastoreDumpUtility), and resources.

    After discussion in the JIRA we decided that keeping the old drivers
    is most likely useless.

    4. Use Hive config properties obtained and curated from real-life usages
    5. Allow AbstractCliConfig to override metastore DB type
    6. Rework CorePerfCliDriver to allow pre-initialized metastores

    Remove system property settings in the initialization of the driver and
    leave in the configuration to set it up if needed. This is necessary to
    be able to use the driver with a preinitialised metastore.

    Remove redundant logs in System.err. Logging and throwing an exception
    is an anti-pattern.

    Replace assertions with exceptions and improve the messages.

    7. Upgrade postgres JDBC driver to version 42.2.14 to be compatible
    with the docker image used
    8. Update path to new postgres version in TestBeelineArgParsing
    9. Display plans using hive.explain.user set to false which seems
    to be preferrable for performing a diff and checking regressions

    10. Disable queries 14 (HIVE-24167), 30 (HIVE-23964)
    11. Re-enable CBO plan tests for queries 44, 45, 67, 70, 86

    The queries were disabled as part of HIVE-20718. They were supposed to
    be fixed in Calcite 1.18.0 and currently Hive is in 1.21.0 so it is not
    surprising that they pass.

    12. Add missing queries: cbo_query41, cbo_query62, query62

commit 4ee5336
Author: Stamatis Zampetakis <zabetak@gmail.com>
Date:   Tue Nov 17 11:13:22 2020 +0100

    HIVE-24395: Intermittent failures to initialize dockerized Postgres metastore in tests

    Ensure the Postgres init process is complete the database port is open
    and accepting connections before declaring the container ready.

commit a2815c2
Author: Stamatis Zampetakis <zabetak@gmail.com>
Date:   Tue Jun 16 22:17:10 2020 +0200

    HIVE-23742: Remove unintentional execution of TPC-DS query39 in qtests
… metastore dump and custom configs"

This reverts commit 5576d77.
…e vill filter out inbvalid cases""

This reverts commit 9d0dd60.
@kgyrtkirk kgyrtkirk changed the title HIVE-24388: Event merge HIVE-24388: Enhance swo optimizations to merge EventOperators Dec 7, 2020
kgyrtkirk added 2 commits Dec 7, 2020
@kgyrtkirk
Copy link
Member Author

@kgyrtkirk kgyrtkirk commented Dec 8, 2020

@jcamachor could you please take a look?
this patch also closes down 2 smaller bugs - one of them might have potentially caused parallel edge creation; I think the issue was not new; the new algorithm just stressed the issue a little bit more...I will ask Sungwoo to check it again after we have this patch

@@ -17,6 +17,7 @@
*/
package org.apache.hadoop.hive.ql.optimizer;

import java.io.File;

This comment has been minimized.

@jcamachor

jcamachor Dec 9, 2020
Contributor

Needed?

This comment has been minimized.

@kgyrtkirk

kgyrtkirk Dec 10, 2020
Author Member

yeah...I've some stuff which write out File-s for now...and I just remove it at the end of preparing the patch....
it gives easier insights into what happened in the SWO.

I would really like to build it into the system and expose it on the HS2 web interface - it would be very usefull...will get to that as well :)
but first I would like to finish this set of patches :)

Filter Operator [FIL_68]
predicate:ss_sold_date_sk is not null
TableScan [TS_0] (rows=123457 width=14)
default@x1_store_sales,s,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk"]

This comment has been minimized.

@jcamachor

jcamachor Dec 9, 2020
Contributor

Could merging these TS operators be effectively worse?

For instance, in this specific mock query, no partition will be pruned for the x1_store_sales table, while before partition pruning was kicking in. Thus, in this case, you are scanning the same data whether you have one or two TS operators (two partitions), however after merging the TS, the size of the data you are shuffling for the join doubles (data in both partitions twice)? Is that analysis correct?

Off the top of my head, this could be beneficial if i) both TS only select a small subset of the partitions in the table, or ii) overlapping in the partition list for those two different TS is greater than a certain threshold. Should we work in that direction, i.e., introduce some config parameters for this?

@rbalamohan , what is your take? It would be helpful to have a second opinion.

This comment has been minimized.

@kgyrtkirk

kgyrtkirk Dec 10, 2020
Author Member

I think right now we don't have a sanity check filter right before the TS to limit reduce shuffled data size to the previous amount; but reducing the number of scans is beneficial - IIRC in the q23 query the scanned partitions were the same; so the benefit was real.

If we make available these things the same way as we have the SJ filters - then we could for sure avoid shuffling more data.

Note: I think it would enable some further opportunities if we would change the SJ data transmission method from RS to EVENTOP - that way we shouldn't have to worry about parallel edges anymore...

This comment has been minimized.

@kgyrtkirk

kgyrtkirk Dec 10, 2020
Author Member

@jcamachor I've added a config knob to control event operator merge

This comment has been minimized.

@rbalamohan

rbalamohan Dec 10, 2020
Contributor

however after merging the TS, the size of the data you are shuffling for the join doubles

Yes, this is possible with merging. +1 on adding an option for enabling it.

There is additional option to reduce number of group by operator evaluations (e.g Q65 computes group by operator twice) which can help in reducing runtime.

@kgyrtkirk kgyrtkirk merged commit 19324b0 into apache:master Dec 11, 2020
2 checks passed
2 checks passed
Travis CI - Pull Request Build Passed
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.