Skip to content

[bug](iceberg) fix can't get migrated Iceberg tables format type#64134

Open
zhangstar333 wants to merge 2 commits into
apache:masterfrom
zhangstar333:iceberg_format_name
Open

[bug](iceberg) fix can't get migrated Iceberg tables format type#64134
zhangstar333 wants to merge 2 commits into
apache:masterfrom
zhangstar333:iceberg_format_name

Conversation

@zhangstar333
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Problem Summary:

[CORRUPTION]Invalid magic number in parquet file, bytes read: 253, file size: 253,
path: /user/hive/warehouse/test_migrate_managed_...,
read magic: ORC .

The migrated Iceberg table properties don't have "write-format" or "write.format.default".
so doris use the default type of parquet as the table format. but it's actual a ORC type.
now add more check to infer the table format type.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zhangstar333
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue in the new fallback path. The PR goal is clear and the main property-order change is small, but the fallback must not silently continue with Parquet when manifest/data-file inspection fails, because that can reproduce the same ORC-as-Parquet corruption path and can also make write/delete planning choose an incorrect format.

Critical checkpoint conclusions:

  • Goal/test: The change targets migrated Iceberg tables without write-format/write.format.default, but no regression/unit/manual test evidence is provided for ORC migrated tables or inference failure paths.
  • Scope: The final code is focused on getFileFormat, but the fallback changes error semantics.
  • Concurrency/lifecycle/config/compatibility: No new shared mutable state, lifecycle management, config, or storage-format compatibility issue found.
  • Parallel paths: getFileFormat is used by scan, insert, delete, merge, compression, and commit metadata paths; the silent Parquet fallback affects all of them.
  • Error handling: Blocking issue below; inference failure is converted to Parquet instead of failing with context.
  • Data correctness: The fallback can select the wrong physical file format after a real planning/IO/auth failure.
  • Performance/observability: planFiles may add extra planning IO for tables without format properties; logs are present, but they are not enough if execution continues with the wrong format.
  • Tests/results: No test changes or result updates in the PR.

User focus: No additional user-provided review focus was specified.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/52) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29458 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0885132ee48b927f77728843b58f81af90d5fb76, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17685	4263	4187	4187
q2	q3	10777	1393	866	866
q4	4681	481	342	342
q5	7667	887	607	607
q6	181	185	138	138
q7	810	844	655	655
q8	9354	1537	1621	1537
q9	5791	4508	4520	4508
q10	6776	1808	1552	1552
q11	442	285	257	257
q12	631	430	302	302
q13	18140	3387	2764	2764
q14	270	261	240	240
q15	q16	818	768	709	709
q17	909	934	903	903
q18	7025	5862	5535	5535
q19	1300	1206	1051	1051
q20	513	415	271	271
q21	6273	2871	2717	2717
q22	468	392	317	317
Total cold run time: 100511 ms
Total hot run time: 29458 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5100	4753	4782	4753
q2	q3	4834	5311	4764	4764
q4	2137	2207	1397	1397
q5	4736	4931	4630	4630
q6	230	179	129	129
q7	1938	1825	1568	1568
q8	2435	2087	2120	2087
q9	7950	7704	7339	7339
q10	4714	4674	4248	4248
q11	533	382	357	357
q12	734	750	529	529
q13	3041	3315	2799	2799
q14	277	282	253	253
q15	q16	669	699	612	612
q17	1285	1263	1247	1247
q18	7164	6790	6925	6790
q19	1110	1083	1106	1083
q20	2232	2204	1964	1964
q21	5281	4591	4443	4443
q22	513	472	415	415
Total cold run time: 56913 ms
Total hot run time: 51407 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169699 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0885132ee48b927f77728843b58f81af90d5fb76, data reload: false

query5	4313	641	489	489
query6	434	196	178	178
query7	4828	566	298	298
query8	364	216	205	205
query9	8771	4090	4063	4063
query10	448	320	258	258
query11	5928	2333	2256	2256
query12	155	105	100	100
query13	1255	609	414	414
query14	6424	5462	5092	5092
query14_1	4429	4445	4376	4376
query15	207	199	174	174
query16	1022	450	446	446
query17	1112	676	565	565
query18	2566	482	334	334
query19	195	185	143	143
query20	110	107	106	106
query21	212	137	119	119
query22	13697	13623	13408	13408
query23	17429	16536	16039	16039
query23_1	16306	16307	16393	16307
query24	7559	1808	1317	1317
query24_1	1311	1306	1360	1306
query25	548	452	378	378
query26	1319	313	166	166
query27	2633	533	324	324
query28	4460	2026	2017	2017
query29	1061	608	481	481
query30	313	232	199	199
query31	1114	1101	967	967
query32	99	57	57	57
query33	515	324	246	246
query34	1175	1105	665	665
query35	764	790	672	672
query36	1402	1426	1240	1240
query37	149	102	94	94
query38	3216	3123	3051	3051
query39	933	930	899	899
query39_1	876	877	894	877
query40	220	127	100	100
query41	66	63	65	63
query42	95	94	98	94
query43	325	329	283	283
query44	
query45	192	188	182	182
query46	1123	1190	765	765
query47	2354	2352	2247	2247
query48	407	431	278	278
query49	625	474	362	362
query50	971	350	257	257
query51	4357	4293	4255	4255
query52	87	88	76	76
query53	241	266	198	198
query54	268	215	192	192
query55	78	76	70	70
query56	238	258	213	213
query57	1446	1407	1327	1327
query58	239	239	212	212
query59	1578	1642	1458	1458
query60	287	247	224	224
query61	151	154	152	152
query62	700	657	580	580
query63	232	182	185	182
query64	2547	775	632	632
query65	
query66	1757	464	338	338
query67	29638	29704	29664	29664
query68	
query69	416	316	256	256
query70	1000	958	940	940
query71	290	220	211	211
query72	2961	2791	2571	2571
query73	814	783	433	433
query74	5136	4973	4765	4765
query75	2690	2576	2255	2255
query76	2328	1218	815	815
query77	363	376	305	305
query78	12554	12295	11901	11901
query79	1312	1093	790	790
query80	639	496	422	422
query81	447	283	246	246
query82	568	159	127	127
query83	375	286	267	267
query84	267	155	122	122
query85	976	647	424	424
query86	363	289	270	270
query87	3370	3395	3201	3201
query88	3710	2751	2733	2733
query89	418	390	333	333
query90	1908	184	186	184
query91	174	168	134	134
query92	63	64	56	56
query93	1433	1463	884	884
query94	562	354	323	323
query95	675	473	350	350
query96	1069	775	339	339
query97	2709	2718	2583	2583
query98	214	206	205	205
query99	1169	1173	1041	1041
Total cold run time: 251295 ms
Total hot run time: 169699 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants