Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] RowReaderImpl::next return inconsistant data in certain case #1512

Open
chenpingzeng opened this issue May 22, 2023 · 3 comments
Open

Comments

@chenpingzeng
Copy link

chenpingzeng commented May 22, 2023

Background info:
In a spark project, we are using orc c++ as acceleration lib to access hdfs files, comparing to original spark table scan with java/scala code.
We found some tpcds-99 sql run fail with data inconstant, occasionally, by now SQL-9/SQL-28/SQL-47 encounters the problem in different clusters.

Appearance:
An example as below, 3TB data generated with tpcds-tool, when executing with 'select ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, ss_sold_date_sk from store_sales where ss_item_sk=302314 and ss_customer_sk=11587351 and ss_quantity=24;',we got result as below:
image
The probleam here is ss_list_price/ss_ext_sales_price actual value is 15.81/7.44, but we get NULL

Analize Info:

  1. It is difficult to add some debug info for a table with over 8 billion records in the issue. we first got a way to check the return data of RowReaderImpl::next and found that data.data() were correct value, but some notNull.data()[i] with 0.

  2. we then add more debug code in orc c++ source code, below are log fragments:
    image
    a. at line 1132529 of log file, notNull.data() was init with all 1 at the beginning, here can see 4096 elements one time
    b. at line 1133878 of log file, reading a new stripe for last_total=0, shows now we are using a new ColumReader which created in RowReaderImpl::startNextStripe()
    c. in 1135312 of log file, shows the end of stripe reading
    d. in 1135372 of log file, shows we are reading a new stripe which has 129 elements/records

the record matches condition 'where ss_item_sk=302314 and ss_customer_sk=11587351 and ss_quantity=24' was one of this 129 records, with a/b/c/d,we can make sure that notNull.data() should contain none 0 values, but it do have.

After more deeper annalize, the probem was found in ColumnReader::next, in case batch elements of last stripe include some NULL values, there were notNull.data()[i] with 0 at some positions.
When reading a new stripe which include no one NULL values, notNull.data() were not initialized, but with polluted values.

After simply add std::memset(notNull.data(), 1, numValues); for ColumnReader::next(ColumnVectorBatch& rowBatch, uint64_t numValues, char* incomingMask), the problem solved.

@dongjoon-hyun
Copy link
Member

Thank you for reporting, @chenpingzeng .

cc @wgtmac , @stiga-huang

@wgtmac
Copy link
Member

wgtmac commented May 23, 2023

Hi @chenpingzeng, https://github.com/apache/orc/pull/1469/files has discussed the same thing.

BTW, the comment seems incorrect. I have opened https://github.com/apache/orc/pull/1515/files to fix it.

@chenpingzeng
Copy link
Author

chenpingzeng commented May 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants