Skip to main content
Ciavolino.Giuseppe
Associate II
July 27, 2017
Question

Issue with the execution time of NOP instruction [STM32F746G-DISCO]

  • July 27, 2017
  • 7 replies
  • 5448 views
Posted on July 27, 2017 at 02:34

hello to everyone,

before starting a project that involves digital signal processing, I'm doing some tests with the stm32f746g-DISCO

to evaluate the capabilities of the board.

In particular, i've measured (toggling the GPIOI_PIN_1) the execution time of NOP : 60ns.

I'm using the CubeMx and I've properly set up the clock configuration to run at 216Mhz (maximum frequency).

Also, i've enabled in the 'Cortex_M7 Configuration' section: TCM Interface, ART ACCELERATOR, Instruction Prefetch,CPU ICache and CPU DCache.

I'm a little bit upset, because 60ns for a NOP is in contrast with the idea of a system core clock that runs at 216 Mhz.

I'm doing something wrong, I'm sure, but I really don't understand where.I've checked the RCC registers and the content is coherent

with the code generated with the CubeMx.

Is there any possibilities that I'm in error? In  the documents related (datasheet, reference manual, programming manual etc.)

there isn't the information that i'm searching.

With pipelining, one cycle machine should match one cycle for instruction...so why this happen?

Sorry for the bad English, this is the first time that i post on an international forum..

Thanks for the attention.  

 

#cortex-m #arm #stm32f7 #stm32-cube-mx #cycle-machine #nop #execution-time
This topic has been closed for replies.

7 replies

Danish1
Lead III
July 27, 2017
Posted on July 27, 2017 at 11:04

How did you measure the time of a NOP?

How strongly do you know that you are measuring just the time of the NOP and not all the overheads?

If you're doing something like (pseudo-code)

while (1) { GPIOI_PIN_1 = !GPIOI_PIN_1; NOP; }

Then the NOP is the _least_ of the things that take time.

You've got the overhead of the jump to make the loop. With a pipelined processor, this can be several cycles.

And (much more significantly) the overhead of reading the port, modifying the value and then writing back the value.

ST did a very good on-line course on stm32f7. I strongly recommend that you read the slides even if you don't actually get the hardware and follow it yourself.

Cortex M7 has dual-issue so it can execute two (non-interfering) instructions simultaneously.

I suppose what you could do is make your test loop:

while (1) { 

GPIOI_PIN_1 = !GPIOI_PIN_1; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; 

}

And then see how the toggle HALF-PERIOD (not frequency) depends on the number of NOPs.

But watch out - an optimiser might realise that the NOPs do nothing and remove them. So do look at the code produced by your compiler before drawing any conclusions.

Hope this helps,

Danish

Jan Waclawek
Visitor II
July 27, 2017
Posted on July 27, 2017 at 11:13

i've measured (toggling the GPIOI_PIN_1) the execution time of NOP

You've measured the execution time of NOP, plus execution time of the instructions toggling the pin, plus whatever instruction inserted by C compiler (unless you used asm), plus time needed to fetch the instructions, plus time needed to propagate the toggle write from processor through busmatrix and GPIU unit to pin. I might have forgotten a few things.

It might quite well ben that the NOP was thrown away in theprefetch unit so its execution time was 0.

Welcome to the world of 32-bitters. These are not microcontrollerst anymore - SoC rather.

JW

AvaTar
Senior III
July 27, 2017
Posted on July 27, 2017 at 16:59

Agree, it's not that simple.

I suggest to measure the toggling alone, than a multiple NOPs (100 or 1000).

Finally subtract the toggle time.

In the Linux world, that measure is called BOGUS Mips  ...

AVI-crak
Senior
July 27, 2017
Posted on July 27, 2017 at 11:34

You need to use direct write to registers.

Arrange the code in the sram memory, to prevent slow reading from the flash.

Use inserts in assembler to exclude GCC optimization.

Use a simple cycle from the maximum number to zero.

Use in the body of the loop a large number of NOP commands (10-50).

Use the system counter DWT to calculate the cycles.

Use an external MCO1 / MCO2 contact to monitor the system frequency.

Kill the desire to use HAL, and begin to study the documentation.

Change the profession, country or sex in the kitchen.

Create your own processor, your own forum, and troll users.
David SIORPAES
ST Employee
July 28, 2017
Posted on July 28, 2017 at 10:58

Try using SEV instruction to emit a pulse instead of toggling GPIOs.

Wrapping 10 NOPs with SEVs on a STM32F401 clocked at 84MHz is consistent with what expected

0690X00000607iYQAQ.png
AvaTar
Senior III
July 28, 2017
Posted on July 28, 2017 at 11:52

Honestly, I don't understand the idea behind this.

For a synchronous MCU design with given clock and an instruction in the pipeline, you get - surprise, surprise - the execution time stated in the datasheet.

Testing an instruction sequence under realistic conditions (clock, Flash latencies, caches, interrupt latencies, DMA bus load, etc.) gives you more (and useful) information.

David SIORPAES
ST Employee
July 28, 2017
Posted on July 28, 2017 at 12:24

As far as I understood the OP is surprised about a NOP instruction execution time he measured (60ns).

Was just suggesting a better method to accomplish succesfully what he had in mind, i.e.: measuring execution time of a NOP instruction.
Ciavolino.Giuseppe
Associate II
July 28, 2017
Posted on July 28, 2017 at 13:30

I apologize if I have not answered yet, but as you have guessed, I do not have great

skills in the field of the embedded and I'm trying to interface your tips with my skills.

Also, I speak a bad English and I want to avoid saying stupid things. Unfortunately I do not have an oscilloscope at home and I can only use the university one,

at this time I can not do specific tests but Monday i will post my results.

I would like to use the stm32f746g-DISCO for acquisition of environmental noise, using the codec

WM8994 for the ADC/DAC's stuff, and the MCU for the data processing.

I've implemented the comunication between MCU and WM8994 (I2C for the settings and SAI for the data)

and now I would like to test the potentiality.I'm using DMA in circular mode to exchange data

between the codec and the microcontroller:

every time a sample is receveid the DMA start a routine interrupt and during this routine

I will do some processing.

So, before working on the algorithm, i would like to know if there will be the potentiality and

for this reason I've tried to measure the NOP.

Thank you so much for support, and sorry for any stupid things I could say.
Ciavolino.Giuseppe
Associate II
July 31, 2017
Posted on July 31, 2017 at 17:16

As I said in the previous post, today I could do some tests.

Instead of measuring the NOP, I preferred to measure the execution time of a math operation: in this case the division.

I found the AN4044 ''Floating Point Unit demonstration on STM32 microcontrollers'' where is reported the number of machine cycle associated to each math operation:

0690X00000607e9QAA.png

So I write this code:

0690X00000607lDQAQ.png

I've checked from the debug that the assembly code is coherent with the FPU's instruction:0690X00000607lJQAQ.png

The time for 10 division operation is 1.08us, so dividing this interval by 10, the time for a single division is 108ns.

Dividing 108ns by 20 (14 cycles+6 cycles) I have the time of one cycle that is 5.4ns.

0690X00000602UDQAY.bmp

This is coherent with a system core clock of 216Mhz?

Tesla DeLorean
Guru
July 31, 2017
Posted on July 31, 2017 at 18:21

Here's a suggestion, use DWT_CYCCNT to count cycles

Stop using C, and replicate the VDIV.F32 s2,s0,s1 10x or 100x times in assembler. This will show the execution of the instruction, not the pairing or pipeline stalls other sequences might introduce.

Is 14 cycles the maximum? Could certain data foreshorten this?

Multiplication by a reciprocal could get you to 1 cycle for this division.

A compiler paying attention could fold this code.

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..
STOne-32
Technical Moderator
July 31, 2017
Posted on July 31, 2017 at 22:10

Dears,

NOP in all cortex-M CPUs is not intended from its original design like on legacy ARM7/ARM9 cores to be  used for timing/counting cycles. But for Padding and align data or Code. Instead you can use  {MOV R0, R0} as example. Look to this article  from ARM web site : 

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/CHDJJGFB.html

 

Good lecture,

Cheers

STOne -32