View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0002098||Ham Radio Deluxe||Bug||public||2017-07-05 14:04||2018-07-25 20:47|
|Platform||Intel i7-5960X||OS||Windows 10 Professional x64||OS Version||1703|
|Target Version||Fixed in Version||184.108.40.2063|
|Summary||0002098: Logbook UI stalls for a while after connecting DX Cluster|
|Description||After connecting DX Cluster, Logbook becomes unresponsive because it creates an inappropriately large number of threads.|
In Task Manager, you can see the number of threads climbing like crazy; for me, it hit 514 with the repro steps below. Private memory and commit size climbs substantially, too. Commit size climbs because each thread gets a committed stack space of 1 megabyte -- 500 megs, just for stack!
The logbook is completely unresponsive for more than a minute on my dev machine (it's a beast!). The user is offered no feedback. The application sluggishly populates the database and then eventually stabilizes. On my machine, that takes around 5 minutes. After the first update, the user may notice an "updated"
It looks like the culprit is the CDXObject::LookupThreadProc method, which uses a couple conditionals to decide that it should lock a critical section before looking up anything, then releases the critical section. This ends up blocking all of the created threads, completely serializing the work.
The 500 threads are created with the first 500 DX Cluster reports received. As further reports come in (even after the application settles down) a thread is created for each request taht comes in. Using a thread per request is an anti-pattern; it doesn't scale well, is (at least, in this case) too small of a granularity of work, and doesn't result in any better performance.
That queue can be serviced by multiple threads. If the options that allowed concurrent work to be performed were switched on, the application would end up ahmmering the database with expensive requests, without reguard to any I/O performance limits or connection count limits.
Some investigation needs to go into figuruing out why the UI thread in the app blocks and prevents the window from updating. Perhaps the system (even my beast of a dev machine!) is clogged by the context switching around so many threads, or there might be something blocking on the main thread. Conceivably, the heap is so busy with all these threads that it can't help the main UI thread.
(The heap is artificially busy, itself: code in OnCallsignLookup inside DXClusterDlg calls HeapValidate, which completely locks the heap for the duration of the run.)
Instead of all this, a queue should be implemented. Requests can be serviced by a few threads (four or eight or so ...) and allowed to queue up until caught up. The thread should manage a data structure rather than adjusting the UI directly. The UI can monitor or poll the data structure and render what it finds there. This assures the UI is able to respond, and isn't hindered by servicing requests from the threaded work items it was meant to offload.
|Steps To Reproduce||1) Fire up logbook|
2) Load a large Logbook database. The one I have has about 60,000 entries.
3) Use the "Options" button on the DX Cluster frame
4) In the resulting "Cluster Options" box, activate the "Connection" tab
5) Set "500" in the "After connecting load spots" drop box.
6) Start task manager
7) Use the Details tab, find HRDLogbook.EXE
8) Make sure you have the "Commit Size", "Threads", and "Memory (Private)" columns
9) In logbook, connect to your favorite DX Cluster
Logbook has become unresponsive.
|Tags||No tags attached.|
|related to||0002580||closed||K7ZCZ||Ham Radio Deluxe||Build 794 Logbook Crashes after clicking on cluster spots|
|related to||0001913||closed||WA9PIE||Ham Radio Deluxe||WSI are inaccurate in the 55x builds or are causing Logbook to freeze/crash|
|related to||0002732||closed||PD9FER||4 - Closed w/o Action||Logbook crashing when idle|
|related to||0002752||new||3 - Current Dev List||Logbook: crash reported when adding second MySQL database to configuration|
The thread(s) created by the DX Dialog end up calling a method on an object owned by the application. That class has "thread" in its name, but the mechanism doesn't enqueue the request to a thread.
The application-owned lookup method shares database connections in the object, so multiple calls from multiple threads end up accessing the same CDatabase instance. They end up failing, since they end up sharing the same HSTMT or HCONN objects -- and that's why the code doesn't work with multiple threads. It never had a chance.
I think we can come up with a much better architecture. Just a few threads to handle the request in a pool, each with its own set of connctions; that parallelizes work, is stable, shares minimal data, and is easy to synchronize.
I've been working on this issue as I've had time. It's quite involved, and other issues have frequently pushed forward in priority. I wanted to share some notes about progres.
Looks like a workaround to avoid having multiple threads working the same database connection was put in place. This hack tries to synchronize the threads so that only one thread is actually performing a lookup. Since only one thread is active, there's no benefit in having multiple threads. The application creates them by the hundreds, one for each request received. This is a terrible approach, since each thread takes a bunch of memory and has a considerable start-up cost.
To fix this correctly, we need to make a few threads (say, five) and let them run concurrently. To run concurrently, each thread needs its own database connection. The existing code can be re-used, but requires significant modification because it uses member data for a cache. That cache would have to be made thread-safe and shared across the threads. (That's actually another problem in the existing design; the shared data wasn't protected, so there's no chance that threads would work as-is.)
Further complicating matters is the fact that the code involved is not commented meaningfully; and the feature of the application is scarcely documented. (See Mantis 2680.) The threads are in place to speed-up the perceived lookup time of new DX Cluster spots. The lookups are slow because the database code involed is terrible; for each spot, six different table scans are done; the data in question is probably in memory anyway because the logbook loads each open database completely into memory, anyway.
Rewriting the database code would be the best solution, but because it's not documented or commented, it's intended operation is not well understood. A rewrite would have a strong chance of being different, but without a specification there's no way to know what behavior is correct.
Maybe it's a mistake, but I'll continue with splitting up the code so that it can work in multiple threads as-is. We can approach a more careful rewrite if a specification is ever developed. I'm making this choice for two reasons. First, the team has been very poor ate coming up with viable specifications for features. Second, the threading issues in the logbook are substantially contributing to its poor stability. A fix for this issue enables progress on those other problems.
This shelvset implements a complete functional fix for this issue.
The DX Cluster dialog receives spots from a DX Cluster server. Each spot is processed to see if it has been previously worked. The spot is examined on a few dimensions -- same mode, callsign, band, and some combinations of those are checked to find a previous contact. Database queries are executed for each spot despite the fact that the necesary data is probably already in memory (in the displayed logbook itself!); multiple databases might be queried if they are configured.
The core problem with this code is its inefficiency. The database queries are poorly written and the database design is marginal, so the queries are between unnecessary and needlessly slow. In reaction to this issue, some users migrate to a different database platform, blaming the Access issue for the poor performance.
The development team reacted to the problem by throwing threads at the problem. This work was done without diligent design or care in implementation. This checkin addresses that issue; unfortunately, it doesn't address the root cause of the poor database design or query implementation as those issues are even more involved than this fix.
The status quo is that one thread is created for each incoming spot observation. The use of an individual, new thread for each incoming request is a well-known systems design anti-pattern, and should never have been implemented. Predictably, the results of this change were very poor; in particular because the connection to a server can request as many as 500 of the most recent spots to pre-populate the window. The initial surge of spots spins up 500 threads, and those threads all concurrently begin blasting slow and poorly-written queries at the database.
A mediation was implemented that caused the threads to block around database access. The blocking solved the problem of overwhelming the database, but the application still created 500 threads -- all waiting for that single lock. Creating threads costs memory and time.
Even if this architecture was working, details of its implementation were unaccaptably poor. The implementation of the lookup code was not thread safe. Most notably, it would reuse the same connection across threads, which doesn't work. Some ODBC drivers (MySQL) crash when concurrently accessed from multiple threads; others detect the problem and return an error (Access, SQL Server).
During the spin-up of all the threads, CPU usage was pegged. Further, each successful lookup resulted in the sending of a message that would repaint the window on the main thread, while blocking the sending thread. This resulted in a near deadlock condition; the main thread was busy painting, the working thread was blocking on the redraw. The intention of putting the lookup work into a separate thread was almost completely nullified by its own implementation.
This change addresses these issues with a significant refactoring of the lookup code:
The change is quite successful in my testing, and results in these observable improvements:
The stability issues surrounding the use of threads are severely mitigated improving the stability of the application. I think this will help address the bugs related to this issue, where clicking on a spot would cause a crash based on race conditions towards the spot data structures.
While significant work was invested in refactoring code, there are still several problems with the code, some of them egregious. With this most significant work completed, further passes over the code might work to address those issues. Most notably, the factoring of the CBackgroundThreadLookup class could be improved.
Fixed with this checkin:
Outwardly, nothing much should be noticed by customers for this change. It would be prudent to check that spots are evaluated and marked correctly. (If any issues are detected, a new Mantis issue is probably best.)
It's easier to notice the threading problems if the connecting parameters are set to a large number. Use the "Options" button in the DX Cluster bar, then activate the "Connection" tab. Under "after connecting, load Spots" to a high number, like 500. The larger number causes a higher impulse load on the Logbook when first connecting to the DX Cluster.
Users should note that the UI is responsive at all times with minimal stalling. And that the UI always updates its status visibly -- that is, the "## spots" and "## updating" counters are moving as the lists are populated.
Testing with multiple open logbook databases (with multiple databases configured to be active for lookups) is helpful.
||Unable to see these columns in Task Manager "Commit Size", "Threads", and "Memory (Private)" columns but the logbook was still responsive with 500 spots selected.|
||Thus far, this "feels" better to me.|
I'm not sure if this is related or if it's just a higher volume of spots due to the KH1 DXpedition... but I'm seeing sluggish behavior in the DX cluster pane today.
Sluggish.PNG (54,304 bytes)
Sluggish.PNG (54,304 bytes)
||The Cluster is very quick for me showing 206 spots.|
With the 864 build, the cluster seems noticeably faster. I was having trouble before. I'm using Access these days... but recently, after having some problems with DX Cluster performance, I changed the MaxBufferSize for Jet 2.x from 514 to 4000.
Here's some info on it:
I'm not sure if that has anything to do with it. But I've seen this make a considerable difference for others using Access for Logbook.
I don't think the advice in the linked article is appropriate for our customers; it's from 1996 (more than 20 years ago) and refers to Jet 3.5.
I think that we should address the performance problems in our application itself; poor memory usage, algorithms with exponential time complexity, inefficient memory management, poor threading models, redundant and inefficient queries, and poor data representation solutions for starters. Addressing these issues will be more impactful and provide greater benefit in improving the product as a whole.
Mike's post in 5578 was also seen by Ron, W3RJW here
but no one else joined in. Must be OK now as he has not commented further.
I went through this scenario and I believe the problem that we were trying to fix is resolved. That is - a delay in loading initial spots.
Clearly, the method of loading them has been greatly improved. I like the way it's being done now.
|2017-07-05 14:04||K7ZCZ||New Issue|
|2017-09-14 23:00||K7ZCZ||Assigned To||=> K7ZCZ|
|2017-09-14 23:00||K7ZCZ||Status||new => assigned|
|2017-09-18 00:14||WA9PIE||Project||3 - Current Dev List => 2 - Next Dev List (Holding Area)|
|2018-04-03 19:47||K7ZCZ||Relationship added||related to 0002580|
|2018-04-05 22:10||K7ZCZ||Note Added: 0004709|
|2018-04-20 22:31||K7ZCZ||Relationship added||related to 0001913|
|2018-05-21 18:32||K7ZCZ||Relationship added||related to 0002732|
|2018-05-30 08:53||K7ZCZ||Relationship added||related to 0002752|
|2018-06-03 18:21||K7ZCZ||Note Added: 0005190|
|2018-06-09 10:41||K7ZCZ||Note Added: 0005228|
|2018-06-09 10:43||K7ZCZ||Note Added: 0005229|
|2018-06-21 14:27||K7ZCZ||Note Added: 0005368|
|2018-06-21 14:32||K7ZCZ||Note Added: 0005369|
|2018-06-21 14:32||K7ZCZ||Status||assigned => resolved|
|2018-06-21 14:32||K7ZCZ||Resolution||open => fixed|
|2018-06-21 14:32||K7ZCZ||Sub-Module||(select) => DX Cluster|
|2018-06-21 14:32||K7ZCZ||Testing||=> Not Started|
|2018-06-21 14:32||K7ZCZ||Project||2 - Next Dev List (Holding Area) => 3 - Current Dev List|
|2018-06-28 10:07||K7ZCZ||Fixed in Version||=> 220.127.116.118|
|2018-06-28 11:53||g3ucq||Note Added: 0005520|
|2018-06-30 00:46||WA9PIE||Note Added: 0005569|
|2018-06-30 21:12||WA9PIE||File Added: Sluggish.PNG|
|2018-06-30 21:12||WA9PIE||Note Added: 0005578|
|2018-07-13 15:56||g3ucq||Note Added: 0005686|
|2018-07-14 16:02||WA9PIE||Note Added: 0005704|
|2018-07-14 22:26||K7ZCZ||Note Added: 0005707|
|2018-07-15 03:51||g3ucq||Note Added: 0005709|
|2018-07-17 16:55||WA9PIE||Status||resolved => closed|
|2018-07-17 16:55||WA9PIE||Description Updated||View Revisions|
|2018-07-17 16:55||WA9PIE||Testing||Not Started => Beta Successful|
|2018-07-17 16:55||WA9PIE||Note Added: 0005726|
|2018-07-23 22:52||WA9PIE||Project||3 - Current Dev List => Ham Radio Deluxe|
|2018-07-24 15:35||WA9PIE||Project||Ham Radio Deluxe => 3 - Current Dev List|
|2018-07-25 08:00||WA9PIE||Fixed in Version||18.104.22.1688 => 22.214.171.1243|
|2018-07-25 20:47||WA9PIE||Project||3 - Current Dev List => Ham Radio Deluxe|