Skip to main content
yonatan
Associate III
May 4, 2025
Solved

STM32H723: How to optimize summation of an array?

  • May 4, 2025
  • 7 replies
  • 1807 views

Hi folks.

I am trying to optimize (by time) the following piece of code.

	for (uint32_t i = 6 + adc_data_index; i < 35 + adc_data_index; i++)
	{
		raw[0] += (adc_data[i]);
		raw[1] += (adc_data[i + 35]);
		raw[2] += (adc_data[i + 70]);
		raw[3] += (adc_data[i + 115]);
	}

For now it takes 3.5 micro-second at 250 MHz clock

I want to make it less by at least factor of 2.

Do you have any ideas?

What I tried:

1. Change the optimization to be -Ofast

2. Using pointer

3. Also, thought about FMAC and DFSDM

How can I achieve that?

Thanks

Yonatan

Best answer by TDK

> 3.5 micro-second at 250 MHz clock

So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

 

Storing raw and adc_data in DTCMRAM will help.

Enabling data cache if not already enabled will help a lot.

Executing the function out of ITCMRAM for the function will also help.

 

Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

7 replies

mbarg.1
Senior III
May 4, 2025

Suggestion: avoid computations in loop, like replacing i with an arrray before running the loop, plus run the loop from a to zero to optimize end of loop check.

 

 

yonatan
yonatanAuthor
Associate III
May 4, 2025

Thanks @mbarg.1 

WDYM in "replacing i with an array"?

 

TDK
TDKAnswer
Super User
May 4, 2025

> 3.5 micro-second at 250 MHz clock

So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

 

Storing raw and adc_data in DTCMRAM will help.

Enabling data cache if not already enabled will help a lot.

Executing the function out of ITCMRAM for the function will also help.

 

Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

"If you feel a post has answered your question, please click ""Accept as Solution""."
yonatan
yonatanAuthor
Associate III
May 5, 2025

Thanks.

1. Does enabling the I/D cache have any downsides?

2. Should I protect the adc_data buffer with the MPU? Is this mandatory?

3. Does placing the adc_data in the DTCM eliminate the need to use the MPU (Is DTCM always protected from cache issues?)

mbarg.1
Senior III
May 5, 2025

Cache will speed execution BUT you must manage it - up to you to decide if extra load and complexity can be a pros or a cons.

Protecting data is application dependent - ADC typically are primitives, aka uint16_t that cannot be invalid but you could need to have the whole set valid before processing - again, up to you to decide.

 

LCE
Principal II
May 5, 2025

A mix of all of the above might help - although I'm afraid of caches... :D

But you probably use the ADC with DMA, so the ADC buffer cannot be placed there.

So I would try:

uint16_t *pu16Adat0 = &adc_data[adc_data_index + 6 + 0]; // pointer type must be same as adc_data!
uint16_t *pu16Adat1 = &adc_data[adc_data_index + 6 + 35];
uint16_t *pu16Adat2 = &adc_data[adc_data_index + 6 + 70];
uint16_t *pu16Adat3 = &adc_data[adc_data_index + 6 + 105]; // or is it really "115" ?

for( uint32_t i = 0; i < 29; i++ )
{
 raw[0] += pu16Adat0[i];
 raw[1] += pu16Adat1[i];
 raw[2] += pu16Adat2[i];
 raw[3] += pu16Adat3[i];
}

 

Interesting to see if using pointers and incrementing these might speed things up, like 

raw[0] += *(puAdat0++);

yonatan
yonatanAuthor
Associate III
May 5, 2025

Thanks!

It saved me ~250 nS

I am counting every clock.

LCE
Principal II
May 5, 2025

It saved me ~250 nS

Oh my, that's disappointing... :(

LCE
Principal II
May 5, 2025

Maybe it helps if you place at least the iteration variable i and the destination buffer raw[] into DTCM.

And / or using data cache might help.

LCE
Principal II
May 6, 2025

Thanks for coming back with the working code!

How's the timing with the function in ITCM RAM?

yonatan
yonatanAuthor
Associate III
May 6, 2025

it saved me something like another ~500 nS

(BTW, I also enabled the ICache so the "profit" is small)

waclawek.jan
Super User
May 6, 2025

Was explicit loop unrolling already mentioned?

uint32_t* p = &adc_data[6 + adc_data_index];
raw[0] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
 p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
 p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
 p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[1] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
 p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
 p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
 p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[2] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
 p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
 p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
 p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[3] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
 p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
 p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
 p[24] + p[25] + p[26] + p[27] + p[28] + p[29];

Observe disasm of the resulting code, you should see a repeating pattern of ld/add.

JW

 

PS. You can MDMA into DTCM. The idea is to gather data from peripherals into SRAM using the "normal" DMA, and then the DMA's transfer complete would trigger MDMA which would in turn move all that data to DTCM for the processor to process further.

yonatan
yonatanAuthor
Associate III
May 6, 2025

Hi @waclawek.jan 

You are right but the problem is that '6' and '35' is not known at compilation time.

They are initialized at run time to their values.

Regarding the MDMA...

In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

waclawek.jan
Super User
May 6, 2025

> the problem is that '6' and '35' is not known at compilation time

That makes things more complicated but not hopeless.

If there's a limited number of '6' and '35' variants, you can have a separate function for each combination (i.e. you compile many functions), and then in runtime chosing whichever is appropriate.

If there are more variants than manageable reasonably, you can use "calculated jumps" amidst the series of additions. Switch/case may accomplish this, but it needs to be checked whether compiler actually compiles it reasonably.

nr = var35 - var6;
p = &adc_data[var6];
sum = 0;
switch(nr) {
 case 29: sum += *p++; // note the intentional fallthrough 
 case 28: sum += *p++;
 case 27: sum += *p++;
 [etc.]
}
p += whatever_remains;

One may here also want to resort to asm, inline or not, if C does not provide enough control over the resulting code - I'm not sure if any compiler recognizes the pattern and actually calculates the jump, most of them should at least use the table-jump instruction (TBB/TBH), but some may be stubborn and generate a branch of jumps, which is useless here.

A partial unroll, together with calculated jump can be used as a slightly worse simplified version, too.  This combination is know as Duff's device.

Another option is to generate the code into RAM in runtime, or use self-modifying code (which may be as simple as inserting at the appropriate place in a sequence of additions a jump out of the sequence).

>> MDMA
> In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

I don't see why would anything got missed here, but I also don't know your whole application.

JW

LCE
Principal II
May 6, 2025

MDMA:

if you're afraid of losing data, you could use also DMA's transfer half-complete interrupt, then trigger MDMA for first half of the buffer.

And / or the DMA's double buffer mode (DBM).