Problem Statement

The main mlme processing kernel thread comes out of the wait_for_completion() immediately without waiting for the complete() to be called on the completion variable. As a result the mlme response was erroneous and the command to check the wifi link status “sudo iw wlan0 link” was returning error.

Findings

Normal working routine of completion variable

The way it is designed is that the main work queue reads the mlme messages from the shared queue and then calls the function wifi_message_handler() to process it. The wifi_message_handler() then sends the message over the netlink interface() to the user space mlme parser and starts waiting for the user space mlme parser to response back with a mlme response packet. The waiting happens using a completion variable. wifi_message_handler() calls wait_for_completion(). The usual path is that the kernel calls back the registered callback function as soon as the user space parser writes something into the netlink interface. The netlink callback function performs its own house keeping and then calls complete() on the completion variable. As a result the main mlme processing work queue gets unblocked. This works fine untill the corner case scenario arises.

Corner Case Scenario

At times there are mlme messages like MlmeAddScan_request which expects three separate responses –  MlmeAddScan_confirm, MlmeScan_indication, MlmeScan_done. These messages cannot be clubbed because the host driver architecture expects these messages separately and will timeout if any of these messages is delayed. So here the work queue waits only for the first response MlmeAddScan_confirm. When this message arrives the kernel netlink callback attempts to do a complete() on the completion variable and unblocks the waiting work queue thread. Things are fine till this point.

Problem starts when the next responses come in which are MlmeScan_indication and MlmeScan_done messages. There is no one waiting on the completion variable and the question is to how to determine whether to call a complete() on the completion variable or not. The initial understanding from the kernel documentation was that completion_done() can be used in this situation to know whether there is anyone waiting on the completion variable or not. We have to understand here what completion_done() actually returns. completion_done() returns FALSE if there is no completion submitted on the completion variable. It returns TRUE is if there is a completion submitted and not yet consumed by the waiter. You can check it by printing the return value of the completion_done() just after calling complete(). The value should be 1 (as I tried myself). And to be sure print the return valuse of comepletion_done() just after the call to wait_for_completion() returns and the value should be 0. That means just after we do complete() there is a comepletion submitted which has not yet been consumed by the waiter i.e. the caller of the wait_for_completion(). And after the call to wait_for_completion() returns there is no completion submitted which has not yet been consumed. Hence, the return value of completion_done() is justified. But point here is that, the return value FALSE or 0 doesn’t imply there is someone waiting, the value is 0 just after someone consumed the submitted completion.

The kernel documentation says if the completion_done() returning FALSE that means there is no completion submitted and hence, there are waiters for the completion variable.

Here is the exact wordings from the kernel documentation link here:

Finally, to check the state of a completion without changing it in any way, 
call completion_done(), which returns false if there are no posted
completions that were not yet consumed by waiters (implying that there are
waiters) and true otherwise;

	bool completion_done(struct completion *done)

That is bit vague. Think about the situation when MlmeScan_indication netlink call back come in, there is no one waiting on a wait_for_completion() – but the completion_done() will return FALSE as there has been no completion submitted since the last one. So if we rely only on the return value of the completion_done() then we will see it is returning FALSE and if we assume there is waiter for the completion variable then we will go ahead and call complete() on the completion variable unwontedly, i.e. without anyone being waiting on it. As a result when next time the  work queue comes and calls for the wait_for_completion()  on that completion variable, it will straightway get the variable and will not wait at all, which is a bug. So to solve the problem I introduced another variable which is set and unset by the caller of the wait_for_completion() and is checked along with completion_done() before calling complete().

1 Comment

Leave a Reply