Bug 50820 - Provisioning App: error in listener stops processing of queue
Provisioning App: error in listener stops processing of queue
Status: VERIFIED FIXED
Product: Z_Internal OX development
Classification: Unclassified
Component: OX-Connector
unspecified
Other Linux
: P5 normal (vote)
: ---
Assigned To: Dirk Wiesenthal
Daniel Tröder
: interim-2
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-02-13 12:19 CET by Daniel Tröder
Modified: 2020-02-25 15:47 CET (History)
0 users

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Tröder univentionstaff 2020-02-13 12:19:01 CET
When there is an error in the handling of an item im the listener queue, a traceback is logged and all processing stopped.

Five seconds later the same happens.

Then forever the same happens, until the disk with the logfile is full and other processes crash.

1. The listeners error handling more be more robust. It should handle the remaining queue items.
2. The listener should have a (logarithmic?) back-off algorithm for retrying the same problematic queue item after 10s, 30s, 1min, 10min, 1h.
3. The problematic queue items file name must be printed to the logfile, so an administrator has the means to manually remove it.
Comment 1 Dirk Wiesenthal univentionstaff 2020-02-25 10:55:54 CET
Fixed with the last line in the listener_trigger:
  run_on_files(objs, run, pause_after_errors_num=3, pause_after_errors_length=60)


(In reply to Daniel Tröder from comment #0)
> 1. The listeners error handling more be more robust. It should handle the
> remaining queue items.

Not done as we do not know if the failed task was important for the subsequent tasks (like creating a context)

> 2. The listener should have a (logarithmic?) back-off algorithm for retrying
> the same problematic queue item after 10s, 30s, 1min, 10min, 1h.

Currently there is no logarithmic increase in the sleep time and it is not configurable.

> 3. The problematic queue items file name must be printed to the logfile, so
> an administrator has the means to manually remove it.

Error while processing /var/lib/univention-appcenter/apps/ox-connector/data/listener/2020-02-24-23-40-30-263937.json
Comment 2 Daniel Tröder univentionstaff 2020-02-25 15:47:01 CET
(In reply to Dirk Wiesenthal from comment #1)
> Fixed with the last line in the listener_trigger:
>   run_on_files(objs, run, pause_after_errors_num=3,
> pause_after_errors_length=60)
> 
> 
> (In reply to Daniel Tröder from comment #0)
> > 1. The listeners error handling more be more robust. It should handle the
> > remaining queue items.
> 
> Not done as we do not know if the failed task was important for the
> subsequent tasks (like creating a context)
OK: we decided to make the behavior this way, as it is safer

> > 2. The listener should have a (logarithmic?) back-off algorithm for retrying
> > the same problematic queue item after 10s, 30s, 1min, 10min, 1h.
> 
> Currently there is no logarithmic increase in the sleep time and it is not
> configurable.
OK: sleeps 60s, when recurring errors are detected

> > 3. The problematic queue items file name must be printed to the logfile, so
> > an administrator has the means to manually remove it.
> 
> Error while processing
> /var/lib/univention-appcenter/apps/ox-connector/data/listener/2020-02-24-23-
> 40-30-263937.json
OK.