bongjunj opened issue #12086:
Hi, this is a follow up from #cranelift > Deoptimizing ISLE rules?.
As mentioned in the Zulip topic above,
I suspect these ISLE rules can degrade performance, since it can increase the number of instructions and thus the overall computation cost.I evaluated the impact of the rule on the sightglass benchmark.
From the main branch of cranelift, I removed the rule to instantiateno-demorganversion,
and compared the execution and compilation performance usingsightglass-cliin CPU cycles.
The average value for 10 repetitions are presented in the table below.Removing the rules lowers the execution time by 22.82% and the compilation overhead by 13.34% for
shootout-keccak, the impact of which being negligible for other cases.
speedup = (main - nodemorgan ) / nodemorganoverhead = (nodemorgan - main) / mainThe rules exist for normalization, pushing
bnotinstructions down the tree and further exploiting it via other simplification rules and GVN.
However, this data says the normalization is under-exploited and can degrade performance forkeccak.
Benchmark Execution (main) Execution (no-demorgan) Speedup Compilation (main) Compilation (no-demorgan) Overhead blake3-scalar 821,287 820,526 0.09% 334,025,595 335,494,881 0.44% blake3-simd 904,391 902,968 0.16% 215,000,060 216,887,637 0.88% bz2 123,375,319 123,972,237 -0.48% 323,600,329 325,059,640 0.45% pulldown-cmark 7,527,276 7,528,312 -0.01% 685,357,632 687,241,047 0.27% regex 287,113,532 287,013,209 0.03% 1,623,122,606 1,628,150,390 0.31% shootout-ackermann 7,766,207 7,769,915 -0.05% 98,442,831 99,049,461 0.62% shootout-base64 377,876,186 377,986,725 -0.03% 94,119,875 94,616,896 0.53% shootout-ctype 796,212,604 796,195,661 0.00% 90,769,795 90,728,785 -0.05% shootout-ed25519 11,062,252,786 11,041,529,973 0.19% 505,160,708 511,230,333 1.20% shootout-fib2 2,991,817,344 2,991,783,776 0.00% 67,992,267 68,207,110 0.32% shootout-gimli 5,143,297 5,153,384 -0.20% 5,846,157 5,843,358 -0.05% shootout-heapsort 2,374,978,997 2,375,615,158 -0.03% 29,560,353 29,690,677 0.44% shootout-keccak 48,797,823 39,731,401 22.82% 292,241,108 253,254,413 -13.34% shootout-matrix 697,653,531 697,060,503 0.09% 93,389,415 93,786,432 0.43% shootout-memmove 37,572,864 37,679,507 -0.28% 95,438,341 95,867,972 0.45% shootout-minicsv 1,239,552,532 1,241,534,405 -0.16% 15,630,009 15,681,735 0.33% shootout-nestedloop 645 621 3.93% 66,921,453 67,062,411 0.21% shootout-random 439,552,157 439,582,225 -0.01% 67,809,477 68,091,002 0.42% shootout-ratelimit 50,251,247 50,384,183 -0.26% 92,983,888 92,922,059 -0.07% shootout-seqhash 15,249,759,981 15,255,584,809 -0.04% 126,530,505 127,360,907 0.66% shootout-sieve 844,263,240 844,508,149 -0.03% 67,092,099 67,628,053 0.80% shootout-switch 153,597,929 153,627,912 -0.02% 144,493,947 144,955,423 0.32% shootout-xblabla20 4,924,967 4,926,304 -0.03% 96,463,543 96,976,115 0.53% shootout-xchacha20 6,468,729 6,467,746 0.02% 96,534,043 96,825,264 0.30% spidermonkey 742,879,491 744,941,257 -0.28% 23,687,766,691 23,749,945,287 0.26%
bongjunj edited issue #12086:
Hi, this is a follow up from #cranelift > Deoptimizing ISLE rules?.
As mentioned in the Zulip topic above,
I suspect these ISLE rules can degrade performance, since it can increase the number of instructions and thus the overall computation cost.I evaluated the impact of the rule on the sightglass benchmark.
From the main branch of cranelift, I removed the rule to instantiateno-demorganversion,
and compared the execution and compilation performance usingsightglass-cliin CPU cycles.
The average value for 10 repetitions are presented in the table below.
My machine is x86-64, and runs with 64-Core and 512GB memory.Removing the rules lowers the execution time by 22.82% and the compilation overhead by 13.34% for
shootout-keccak, the impact of which being negligible for other cases.
speedup = (main - nodemorgan ) / nodemorganoverhead = (nodemorgan - main) / mainThe rules exist for normalization, pushing
bnotinstructions down the tree and further exploiting it via other simplification rules and GVN.
However, this data says the normalization is under-exploited and can degrade performance forkeccak.
Benchmark Execution (main) Execution (no-demorgan) Speedup Compilation (main) Compilation (no-demorgan) Overhead blake3-scalar 821,287 820,526 0.09% 334,025,595 335,494,881 0.44% blake3-simd 904,391 902,968 0.16% 215,000,060 216,887,637 0.88% bz2 123,375,319 123,972,237 -0.48% 323,600,329 325,059,640 0.45% pulldown-cmark 7,527,276 7,528,312 -0.01% 685,357,632 687,241,047 0.27% regex 287,113,532 287,013,209 0.03% 1,623,122,606 1,628,150,390 0.31% shootout-ackermann 7,766,207 7,769,915 -0.05% 98,442,831 99,049,461 0.62% shootout-base64 377,876,186 377,986,725 -0.03% 94,119,875 94,616,896 0.53% shootout-ctype 796,212,604 796,195,661 0.00% 90,769,795 90,728,785 -0.05% shootout-ed25519 11,062,252,786 11,041,529,973 0.19% 505,160,708 511,230,333 1.20% shootout-fib2 2,991,817,344 2,991,783,776 0.00% 67,992,267 68,207,110 0.32% shootout-gimli 5,143,297 5,153,384 -0.20% 5,846,157 5,843,358 -0.05% shootout-heapsort 2,374,978,997 2,375,615,158 -0.03% 29,560,353 29,690,677 0.44% shootout-keccak 48,797,823 39,731,401 22.82% 292,241,108 253,254,413 -13.34% shootout-matrix 697,653,531 697,060,503 0.09% 93,389,415 93,786,432 0.43% shootout-memmove 37,572,864 37,679,507 -0.28% 95,438,341 95,867,972 0.45% shootout-minicsv 1,239,552,532 1,241,534,405 -0.16% 15,630,009 15,681,735 0.33% shootout-nestedloop 645 621 3.93% 66,921,453 67,062,411 0.21% shootout-random 439,552,157 439,582,225 -0.01% 67,809,477 68,091,002 0.42% shootout-ratelimit 50,251,247 50,384,183 -0.26% 92,983,888 92,922,059 -0.07% shootout-seqhash 15,249,759,981 15,255,584,809 -0.04% 126,530,505 127,360,907 0.66% shootout-sieve 844,263,240 844,508,149 -0.03% 67,092,099 67,628,053 0.80% shootout-switch 153,597,929 153,627,912 -0.02% 144,493,947 144,955,423 0.32% shootout-xblabla20 4,924,967 4,926,304 -0.03% 96,463,543 96,976,115 0.53% shootout-xchacha20 6,468,729 6,467,746 0.02% 96,534,043 96,825,264 0.30% spidermonkey 742,879,491 744,941,257 -0.28% 23,687,766,691 23,749,945,287 0.26%
bongjunj edited issue #12086:
Hi, this is a follow up from #cranelift > Deoptimizing ISLE rules?.
As mentioned in the Zulip topic above,
I suspect these ISLE rules can degrade performance, since it can increase the number of instructions and thus the overall computation cost.I evaluated the impact of the rule on the sightglass benchmark.
From the main branch of cranelift, I removed the rule to instantiateno-demorganversion,
and compared the execution and compilation performance usingsightglass-cliin CPU cycles.
The average value for 10 repetitions are presented in the table below.
My machine is x86-64 and runs with 64-Core and 512GB memory.Removing the rules lowers the execution time by 22.82% and the compilation overhead by 13.34% for
shootout-keccak, the impact of which being negligible for other cases.
speedup = (main - nodemorgan ) / nodemorganoverhead = (nodemorgan - main) / mainThe rules exist for normalization, pushing
bnotinstructions down the tree and further exploiting it via other simplification rules and GVN.
However, this data says the normalization is under-exploited and can degrade performance forkeccak.
Benchmark Execution (main) Execution (no-demorgan) Speedup Compilation (main) Compilation (no-demorgan) Overhead blake3-scalar 821,287 820,526 0.09% 334,025,595 335,494,881 0.44% blake3-simd 904,391 902,968 0.16% 215,000,060 216,887,637 0.88% bz2 123,375,319 123,972,237 -0.48% 323,600,329 325,059,640 0.45% pulldown-cmark 7,527,276 7,528,312 -0.01% 685,357,632 687,241,047 0.27% regex 287,113,532 287,013,209 0.03% 1,623,122,606 1,628,150,390 0.31% shootout-ackermann 7,766,207 7,769,915 -0.05% 98,442,831 99,049,461 0.62% shootout-base64 377,876,186 377,986,725 -0.03% 94,119,875 94,616,896 0.53% shootout-ctype 796,212,604 796,195,661 0.00% 90,769,795 90,728,785 -0.05% shootout-ed25519 11,062,252,786 11,041,529,973 0.19% 505,160,708 511,230,333 1.20% shootout-fib2 2,991,817,344 2,991,783,776 0.00% 67,992,267 68,207,110 0.32% shootout-gimli 5,143,297 5,153,384 -0.20% 5,846,157 5,843,358 -0.05% shootout-heapsort 2,374,978,997 2,375,615,158 -0.03% 29,560,353 29,690,677 0.44% shootout-keccak 48,797,823 39,731,401 22.82% 292,241,108 253,254,413 -13.34% shootout-matrix 697,653,531 697,060,503 0.09% 93,389,415 93,786,432 0.43% shootout-memmove 37,572,864 37,679,507 -0.28% 95,438,341 95,867,972 0.45% shootout-minicsv 1,239,552,532 1,241,534,405 -0.16% 15,630,009 15,681,735 0.33% shootout-nestedloop 645 621 3.93% 66,921,453 67,062,411 0.21% shootout-random 439,552,157 439,582,225 -0.01% 67,809,477 68,091,002 0.42% shootout-ratelimit 50,251,247 50,384,183 -0.26% 92,983,888 92,922,059 -0.07% shootout-seqhash 15,249,759,981 15,255,584,809 -0.04% 126,530,505 127,360,907 0.66% shootout-sieve 844,263,240 844,508,149 -0.03% 67,092,099 67,628,053 0.80% shootout-switch 153,597,929 153,627,912 -0.02% 144,493,947 144,955,423 0.32% shootout-xblabla20 4,924,967 4,926,304 -0.03% 96,463,543 96,976,115 0.53% shootout-xchacha20 6,468,729 6,467,746 0.02% 96,534,043 96,825,264 0.30% spidermonkey 742,879,491 744,941,257 -0.28% 23,687,766,691 23,749,945,287 0.26%
bongjunj edited issue #12086:
Hi, this is a follow up from #cranelift > Deoptimizing ISLE rules?.
As mentioned in the Zulip topic above,
I suspect these ISLE rules can degrade performance, since it can increase the number of instructions and thus the overall computation cost.I evaluated the impact of the rule on the sightglass benchmark.
From the main branch of cranelift, I removed the rule to instantiateno-demorganversion,
and compared the execution and compilation performance usingsightglass-cliin CPU cycles.
The average value for 10 repetitions are presented in the table below.
My machine is x86-64 linux and runs with 64-Core and 512GB memory.Removing the rules lowers the execution time by 22.82% and the compilation overhead by 13.34% for
shootout-keccak, the impact of which being negligible for other cases.
speedup = (main - nodemorgan ) / nodemorganoverhead = (nodemorgan - main) / mainThe rules exist for normalization, pushing
bnotinstructions down the tree and further exploiting it via other simplification rules and GVN.
However, this data says the normalization is under-exploited and can degrade performance forkeccak.
Benchmark Execution (main) Execution (no-demorgan) Speedup Compilation (main) Compilation (no-demorgan) Overhead blake3-scalar 821,287 820,526 0.09% 334,025,595 335,494,881 0.44% blake3-simd 904,391 902,968 0.16% 215,000,060 216,887,637 0.88% bz2 123,375,319 123,972,237 -0.48% 323,600,329 325,059,640 0.45% pulldown-cmark 7,527,276 7,528,312 -0.01% 685,357,632 687,241,047 0.27% regex 287,113,532 287,013,209 0.03% 1,623,122,606 1,628,150,390 0.31% shootout-ackermann 7,766,207 7,769,915 -0.05% 98,442,831 99,049,461 0.62% shootout-base64 377,876,186 377,986,725 -0.03% 94,119,875 94,616,896 0.53% shootout-ctype 796,212,604 796,195,661 0.00% 90,769,795 90,728,785 -0.05% shootout-ed25519 11,062,252,786 11,041,529,973 0.19% 505,160,708 511,230,333 1.20% shootout-fib2 2,991,817,344 2,991,783,776 0.00% 67,992,267 68,207,110 0.32% shootout-gimli 5,143,297 5,153,384 -0.20% 5,846,157 5,843,358 -0.05% shootout-heapsort 2,374,978,997 2,375,615,158 -0.03% 29,560,353 29,690,677 0.44% shootout-keccak 48,797,823 39,731,401 22.82% 292,241,108 253,254,413 -13.34% shootout-matrix 697,653,531 697,060,503 0.09% 93,389,415 93,786,432 0.43% shootout-memmove 37,572,864 37,679,507 -0.28% 95,438,341 95,867,972 0.45% shootout-minicsv 1,239,552,532 1,241,534,405 -0.16% 15,630,009 15,681,735 0.33% shootout-nestedloop 645 621 3.93% 66,921,453 67,062,411 0.21% shootout-random 439,552,157 439,582,225 -0.01% 67,809,477 68,091,002 0.42% shootout-ratelimit 50,251,247 50,384,183 -0.26% 92,983,888 92,922,059 -0.07% shootout-seqhash 15,249,759,981 15,255,584,809 -0.04% 126,530,505 127,360,907 0.66% shootout-sieve 844,263,240 844,508,149 -0.03% 67,092,099 67,628,053 0.80% shootout-switch 153,597,929 153,627,912 -0.02% 144,493,947 144,955,423 0.32% shootout-xblabla20 4,924,967 4,926,304 -0.03% 96,463,543 96,976,115 0.53% shootout-xchacha20 6,468,729 6,467,746 0.02% 96,534,043 96,825,264 0.30% spidermonkey 742,879,491 744,941,257 -0.28% 23,687,766,691 23,749,945,287 0.26%
cfallin commented on issue #12086:
Thanks for this analysis!
I am reading the data as: small but positive impact on most benchmarks; with one large negative outlier (
keccak) as you mention. To me that suggests some adverse impact in an inner loop; perhaps we can fix it. Would you be able to dig a bit deeper on the benchmark (e.g. by profiling and comparing the hottest basic blocks before and after) to see if you can narrow down the cause?
bongjunj commented on issue #12086:
Thanks for the suggestion! Gonna run those analysis.
Last updated: Dec 06 2025 at 06:05 UTC