Skip to main content
Piranha
Principal III
September 28, 2019
Solved

[BUG] STM32 lwIP Ethernet driver Rx deadlock

  • September 28, 2019
  • 4 replies
  • 8588 views

This bug is present in F1, F2, F4 and F7 series examples and CubeMX generated code with RTOS and is one of the biggest flaws in ST's lwIP integration.

Problem

https://github.com/STMicroelectronics/STM32CubeF7/blob/3600603267ebc7da619f50542e99bbdfd7e35f4a/Projects/STM32F767ZI-Nucleo/Applications/LwIP/LwIP_HTTP_Server_Netconn_RTOS/Src/ethernetif.c#L376

lwIP core (also known as the "tcpip_thread") calls low_level_output(), which calls HAL_ETH_TransmitFrame(). While the latter is processing, Ethernet input thread (ethernetif_input() function) can call low_level_input(), which calls HAL_ETH_GetReceivedFrame_IT(). Because both of those HAL functions use HAL's ingeniously stupid "lock" mechanism, HAL_ETH_GetReceivedFrame_IT() returns HAL_BUSY. Subsequently that makes low_level_input() to return NULL and ethernetif_input() to go back on waiting for a semaphore.

Consequences

  • Received frames are not processed and Rx buffers are not released to DMA.
  • When the next frames are received, code will iterate and try to process all previous frames up to the current frame. But again, if at some iteration HAL_BUSY is returned, the rest of the frames will be left unprocessed.
  • If no more frames do come from network, then the frames waiting in Rx buffers will never be processed.
  • If all Rx descriptors have been used (OWN bit cleared) when semaphore is acquired and HAL_BUSY is returned on processing the first frame, no more Rx complete interrupts will be generated. Consequently semaphore will not be released and Ethernet input thread will be stuck forever on waiting for that semaphore.

Solution

This code:

if(HAL_ETH_GetReceivedFrame_IT(&EthHandle) != HAL_OK)
	return NULL;

Must be replaced with this code:

HAL_StatusTypeDef status;
 
LOCK_TCPIP_CORE();
status = HAL_ETH_GetReceivedFrame_IT(&EthHandle);
UNLOCK_TCPIP_CORE();
 
if (status != HAL_OK) {
	return NULL;
}

This topic has been closed for replies.
Best answer by Piranha

The good news is that ST has fixed this bug in CubeMX v5.5.0. The bad news is that it's done in the same sub-optimal way the H7 series already had. They've not followed my suggestion, but have put the lwIP core locking in ethernetif_input() function:

do
 { 
 LOCK_TCPIP_CORE();
 p = low_level_input( netif );
 if (p != NULL)
 {
 if (netif->input( p, netif) != ERR_OK )
 {
 pbuf_free(p);
 }
 }
 UNLOCK_TCPIP_CORE();
 } while(p!=NULL);

As ST do not read documentation (Multithreading) and do not analyse code, they don't know that both of these..

  1. pbuf_free()
  2. netif->input() => tcpip_input() => tcpip_inpkt()

..are thread safe. Therefore the code locks lwIP core for a longer time than it's necessary and is sub-optimal. Poor fix, but at least it fixes the deadlock issue!

P.S. The examples are still flawed regarding this!

4 replies

Amel NASRI
Technical Moderator
October 4, 2019

Hi @Piranha​ ,

Thanks for this detailed explanation and the suggested solution.

This is reported internally for farther review by development teams.

-Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
Piranha
PiranhaAuthor
Principal III
October 9, 2019

Hi, @Amel NASRI​ !

That's good to hear!

https://community.st.com/s/question/0D50X0000BOtfhnSQB

Please reopen that topic. The particular topic's purpose is not for ST employees, but for users of this forum and everyone on internet. It's not a bug-report in itself, but a collection of all Ethernet/lwIP related critical problems in one place. Therefore it's like a how-to to which a single URL can be given to most of the hundreds of "Ethernet/lwIP not working on STM32" voices out there. Two items is just the beginning... I'm already preparing at least 5 more bug reports on this and probably there will be even more!

ST has failed to provide a working Ethernet/lwIP solution since releasing STM32F107 in year 2008. Since that time internet has been filled with ST's flawed code, which people are and will use even if ST fixes current drivers, examples and CubeMX generated code. If all of these network related problems will not be collected in a single place, 98% of people (which by the way applies to ST developers also) will never be able to fix those problems. So, please, at least don't suppress others in doing ST's job.

Amel NASRI
Technical Moderator
October 9, 2019

OK, I'll re-open it.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
Piranha
PiranhaAuthorAnswer
Principal III
February 1, 2020

The good news is that ST has fixed this bug in CubeMX v5.5.0. The bad news is that it's done in the same sub-optimal way the H7 series already had. They've not followed my suggestion, but have put the lwIP core locking in ethernetif_input() function:

do
 { 
 LOCK_TCPIP_CORE();
 p = low_level_input( netif );
 if (p != NULL)
 {
 if (netif->input( p, netif) != ERR_OK )
 {
 pbuf_free(p);
 }
 }
 UNLOCK_TCPIP_CORE();
 } while(p!=NULL);

As ST do not read documentation (Multithreading) and do not analyse code, they don't know that both of these..

  1. pbuf_free()
  2. netif->input() => tcpip_input() => tcpip_inpkt()

..are thread safe. Therefore the code locks lwIP core for a longer time than it's necessary and is sub-optimal. Poor fix, but at least it fixes the deadlock issue!

P.S. The examples are still flawed regarding this!

iw2lsi
Associate III
March 11, 2020

Hi...

it could be that this particular bug has been fixed, but as far we can see, CubeMX is still doing weird things on a STM32F767 Nucleo board (i.e. by setting PHY_ADDRESS=1 in some circustances) and even a simple ICMP (ping) still hangs after some time (and we are talking of minutes, not days or weeks)...

I wonder how can it be that ST is not paying attention on this and is not focusing on such a crucial stack as LWIP...

Best Regards,

Giampaolo

Nettok
Visitor II
April 23, 2020

Indeed, I was already using the new CubeMX version with the fix, and ICMP (ping) still hangs after some time.

I made a couple of changes which I outline in a comment in this thread: https://community.st.com/s/question/0D50X0000BMErkFSQT/stm32f407-lwip-ip-stack-stop-working-during-tcp-port-scanning-from-vulnerabilities-test-tool

And it improved things, but I let it running a test overnight and it hanged again after a couple of hours.

MHama.3
Associate II
May 24, 2022

@Piranha​ first of all thanks for your efforts and your probosed solution

  1. I tried to do like you did but the LOCK_TCPIP_CORE(); is not defined in my program
  2. HAL_StatusTypeDef status;
  3.  
  4. LOCK_TCPIP_CORE();
  5. status = HAL_ETH_GetReceivedFrame_IT(&EthHandle);
  6. UNLOCK_TCPIP_CORE();
  7.  
  8. if (status != HAL_OK) {
  9. return NULL;
  10. }

I use Lwip without Rtos and I am facing the same problem you had, I can ping my board if no Traffic is there

Where can I finde the definition of the function LOCK_TCPIP_CORE();

Best regards

Mosaab

Piranha
PiranhaAuthor
Principal III
May 24, 2022

You don't need those macros at all, because the lwIP core locking is only necessary with RTOS.

MHama.3
Associate II
May 25, 2022

do you have any Idea, what to do in my case. Because i tried every thing but the board is still not responding, busy with all packets in the network

PHolt.1
Senior
July 23, 2022

I posted this before but it got no response.

I think the mutex macro LOCK_TCPIP_CORE() is not right, because if you then try to make LWIP thread-safe, by setting LWIP_TCPIP_CORE_LOCKING=1, you activate the same set of mutexes at the LWIP API, which obviously bombs the whole thing because your API call (say to open a socket) sets the API mutex and then the same mutex is attempted to be set again by the above low level code.

Unless you can somehow be sure that your two low level functions cannot be re-entered (and I can't see how you can be sure if you are calling the LWIP API from multiple threads) then you have to mutex them yourself.

But you have to use a different set of mutexes. That's what I did. I just pulled two more out of FreeRTOS and used them.

The trouble comes if you are calling either function under an interrupt. Then you cannot mutex them.

Transmission does not need interrupts because a call to LWIP to send some data to XYZ eventually results in packets being sent out on ETH, so using the ISR for that adds no value that I can see.

Reception does benefit from interrupts, compared to polling, greatly. I am not sure how to handle this though if you have a multi threaded LWIP. I posted what I did here

https://www.eevblog.com/forum/microcontrollers/anyone-here-familiar-with-lwip/

but I am still doing polled RX.

Piranha
PiranhaAuthor
Principal III
July 24, 2022

The code this topic is talking about is located in low_level_input() function, which is called only from the Ethernet input thread (ethernetif_input() function). And the low_level_output() function is called only either from the core thread ("tcpip_thread") or from other threads with the core (thread) locked. Therefore, if the other code is not broken, neither of these two functions can be re-entered.

Take a note that this whole "solution" was only a sub-optimal workaround to make ST's lwIP implementation working at all, while not changing the HAL ETH driver. The real solution is to remove the idiotic and broken HAL "lock" mechanism from the driver. When that is done, there is no reason to use the workaround this topic is about. And the new rewritten zero-copy driver doesn't have HAL "lock" in it anymore. If you still want to waste your time on fixing a code created by incompetent fools, I recommend taking their new rewritten driver and lwIP integration as the base and fixing it's flaws. It will require less work and it will be more compatible with their future code.

In zero-copy drivers the transmission benefits from interrupt on completion of the frame transmission - at that point in time the driver can free the buffer. And, if SYS_LIGHTWEIGHT_PROT=1 and LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT=1, pbuf_free() can even be called directly from an ISR, improving the overall performance. Reception is easy - signal the event from the ISR and process the actual frames in the Ethernet input processing thread. Now with the rewritten driver ST's ethernetif_input() implementation is correct. It's just recommended to use thread events/flags/notifications instead of semaphore for better performance and less resource usage.

I'll read your topics on EEVblog and comment later... :smiling_face_with_smiling_eyes:

ktrofimo
Senior III
September 16, 2022

Working with the latest H7 HAL code still found some mutex lockup in low_level_output() after HAL_ETH_Transmit_IT() :

while( osSemaphoreAcquire( TxPktSemaphore, TIME_WAITING_FOR_INPUT) != osOK )
{
}

Looks like your solution with locking low_level_input() is working in this case too...