View Issue Details

IDProjectCategoryView StatusLast Update
0002098Ham Radio DeluxeBugpublic2018-07-25 20:47
ReporterK7ZCZ 
Assigned ToK7ZCZ 
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionfixed 
PlatformIntel i7-5960XOSWindows 10 Professional x64OS Version1703
Product Version6.4.0.658 
Target VersionFixed in Version6.4.0.873 
Summary0002098: Logbook UI stalls for a while after connecting DX Cluster
DescriptionAfter connecting DX Cluster, Logbook becomes unresponsive because it creates an inappropriately large number of threads.

In Task Manager, you can see the number of threads climbing like crazy; for me, it hit 514 with the repro steps below. Private memory and commit size climbs substantially, too. Commit size climbs because each thread gets a committed stack space of 1 megabyte -- 500 megs, just for stack!

The logbook is completely unresponsive for more than a minute on my dev machine (it's a beast!). The user is offered no feedback. The application sluggishly populates the database and then eventually stabilizes. On my machine, that takes around 5 minutes. After the first update, the user may notice an "updated"

It looks like the culprit is the CDXObject::LookupThreadProc method, which uses a couple conditionals to decide that it should lock a critical section before looking up anything, then releases the critical section. This ends up blocking all of the created threads, completely serializing the work.

The 500 threads are created with the first 500 DX Cluster reports received. As further reports come in (even after the application settles down) a thread is created for each request taht comes in. Using a thread per request is an anti-pattern; it doesn't scale well, is (at least, in this case) too small of a granularity of work, and doesn't result in any better performance.

That queue can be serviced by multiple threads. If the options that allowed concurrent work to be performed were switched on, the application would end up ahmmering the database with expensive requests, without reguard to any I/O performance limits or connection count limits.

Some investigation needs to go into figuruing out why the UI thread in the app blocks and prevents the window from updating. Perhaps the system (even my beast of a dev machine!) is clogged by the context switching around so many threads, or there might be something blocking on the main thread. Conceivably, the heap is so busy with all these threads that it can't help the main UI thread.

(The heap is artificially busy, itself: code in OnCallsignLookup inside DXClusterDlg calls HeapValidate, which completely locks the heap for the duration of the run.)

Instead of all this, a queue should be implemented. Requests can be serviced by a few threads (four or eight or so ...) and allowed to queue up until caught up. The thread should manage a data structure rather than adjusting the UI directly. The UI can monitor or poll the data structure and render what it finds there. This assures the UI is able to respond, and isn't hindered by servicing requests from the threaded work items it was meant to offload.
Steps To Reproduce1) Fire up logbook
2) Load a large Logbook database. The one I have has about 60,000 entries.
3) Use the "Options" button on the DX Cluster frame
4) In the resulting "Cluster Options" box, activate the "Connection" tab
5) Set "500" in the "After connecting load spots" drop box.
6) Start task manager
7) Use the Details tab, find HRDLogbook.EXE
8) Make sure you have the "Commit Size", "Threads", and "Memory (Private)" columns
9) In logbook, connect to your favorite DX Cluster

Logbook has become unresponsive.
TagsNo tags attached.
ModuleLogbook
Sub-ModuleDX Cluster
Testing Beta Successful

Relationships

related to 0002580 closedK7ZCZ Ham Radio Deluxe Build 794 Logbook Crashes after clicking on cluster spots 
related to 0001913 closedWA9PIE Ham Radio Deluxe WSI are inaccurate in the 55x builds or are causing Logbook to freeze/crash 
related to 0002732 closedPD9FER 4 - Closed w/o Action Logbook crashing when idle 
related to 0002752 new 3 - Current Dev List Logbook: crash reported when adding second MySQL database to configuration 

Activities

K7ZCZ

2018-04-05 22:10

manager   ~0004709

The thread(s) created by the DX Dialog end up calling a method on an object owned by the application. That class has "thread" in its name, but the mechanism doesn't enqueue the request to a thread.

The application-owned lookup method shares database connections in the object, so multiple calls from multiple threads end up accessing the same CDatabase instance. They end up failing, since they end up sharing the same HSTMT or HCONN objects -- and that's why the code doesn't work with multiple threads. It never had a chance.

I think we can come up with a much better architecture. Just a few threads to handle the request in a pool, each with its own set of connctions; that parallelizes work, is stable, shares minimal data, and is easy to synchronize.

K7ZCZ

2018-06-03 18:21

manager   ~0005190

I've been working on this issue as I've had time. It's quite involved, and other issues have frequently pushed forward in priority. I wanted to share some notes about progres.

Looks like a workaround to avoid having multiple threads working the same database connection was put in place. This hack tries to synchronize the threads so that only one thread is actually performing a lookup. Since only one thread is active, there's no benefit in having multiple threads. The application creates them by the hundreds, one for each request received. This is a terrible approach, since each thread takes a bunch of memory and has a considerable start-up cost.

To fix this correctly, we need to make a few threads (say, five) and let them run concurrently. To run concurrently, each thread needs its own database connection. The existing code can be re-used, but requires significant modification because it uses member data for a cache. That cache would have to be made thread-safe and shared across the threads. (That's actually another problem in the existing design; the shared data wasn't protected, so there's no chance that threads would work as-is.)

Further complicating matters is the fact that the code involved is not commented meaningfully; and the feature of the application is scarcely documented. (See Mantis 2680.) The threads are in place to speed-up the perceived lookup time of new DX Cluster spots. The lookups are slow because the database code involed is terrible; for each spot, six different table scans are done; the data in question is probably in memory anyway because the logbook loads each open database completely into memory, anyway.

Rewriting the database code would be the best solution, but because it's not documented or commented, it's intended operation is not well understood. A rewrite would have a strong chance of being different, but without a specification there's no way to know what behavior is correct.

Maybe it's a mistake, but I'll continue with splitting up the code so that it can work in multiple threads as-is. We can approach a more careful rewrite if a specification is ever developed. I'm making this choice for two reasons. First, the team has been very poor ate coming up with viable specifications for features. Second, the threading issues in the logbook are substantially contributing to its poor stability. A fix for this issue enables progress on those other problems.

K7ZCZ

2018-06-09 10:41

manager   ~0005228

This shelvset implements a complete functional fix for this issue.
https://hrdsoftware.visualstudio.com/HRD/_versionControl/shelveset?ss=DX%20Cluster%20Threading%20%28Mantis%202098%29%3Bmikeblas%40msn.com

K7ZCZ

2018-06-09 10:43

manager   ~0005229

The DX Cluster dialog receives spots from a DX Cluster server. Each spot is processed to see if it has been previously worked. The spot is examined on a few dimensions -- same mode, callsign, band, and some combinations of those are checked to find a previous contact. Database queries are executed for each spot despite the fact that the necesary data is probably already in memory (in the displayed logbook itself!); multiple databases might be queried if they are configured.

The core problem with this code is its inefficiency. The database queries are poorly written and the database design is marginal, so the queries are between unnecessary and needlessly slow. In reaction to this issue, some users migrate to a different database platform, blaming the Access issue for the poor performance.

The development team reacted to the problem by throwing threads at the problem. This work was done without diligent design or care in implementation. This checkin addresses that issue; unfortunately, it doesn't address the root cause of the poor database design or query implementation as those issues are even more involved than this fix.

The status quo is that one thread is created for each incoming spot observation. The use of an individual, new thread for each incoming request is a well-known systems design anti-pattern, and should never have been implemented. Predictably, the results of this change were very poor; in particular because the connection to a server can request as many as 500 of the most recent spots to pre-populate the window. The initial surge of spots spins up 500 threads, and those threads all concurrently begin blasting slow and poorly-written queries at the database.

A mediation was implemented that caused the threads to block around database access. The blocking solved the problem of overwhelming the database, but the application still created 500 threads -- all waiting for that single lock. Creating threads costs memory and time.

Even if this architecture was working, details of its implementation were unaccaptably poor. The implementation of the lookup code was not thread safe. Most notably, it would reuse the same connection across threads, which doesn't work. Some ODBC drivers (MySQL) crash when concurrently accessed from multiple threads; others detect the problem and return an error (Access, SQL Server).

During the spin-up of all the threads, CPU usage was pegged. Further, each successful lookup resulted in the sending of a message that would repaint the window on the main thread, while blocking the sending thread. This resulted in a near deadlock condition; the main thread was busy painting, the working thread was blocking on the redraw. The intention of putting the lookup work into a separate thread was almost completely nullified by its own implementation.

This change addresses these issues with a significant refactoring of the lookup code:


  • A single lookup object was available at the application level. This class has been refactored so that it's easy (and possible) to create multiple instances of the object. The object owns its own connections to the database, so each instance of the management object can use its own private connections.

  • The remediation code is removed, and multiple concurrent requests to the database are now possible.

  • To avoid the C10K problem, exactly 5 threads are created. (This is an arbitrary choice; maybe it can become configurable if problems are evident.) The threads poll an internal queue populated with received lookup requests generated by received spots. The threads smoothly populate the requests without consuming egregious CPU or overwhelming the database server -- regardless of its type.

  • A cache which was local to the single instance of the management object is re-factored to be its own standalone object. With a single cache object, each running thread can reference a single instance of the cache and share as much data as possible in order to reduce the number of lookups performed against the database. The cache is further made thread-safe, a feature not present in the status quo implementation.

  • Finally, issues surrounding spurious repainting of the app's UI in response to lookup progress are addressed.



The change is quite successful in my testing, and results in these observable improvements:


  • Connecting to a DX Cluster, even with a large prefetch setting, doesn't stall the application for very long. In my tests, the time the UI of the application goes non responsive has dropped by about 2200%.

  • The UI is responsive while DX Cluster lookups are happening.

  • The UI shows and smoothly updates "spots" and "pending" counters that indicate progress.

  • The "in progress" state of a displayed spot (is it looked up yet, or not?) is clearly visible and accurately updated.



The stability issues surrounding the use of threads are severely mitigated improving the stability of the application. I think this will help address the bugs related to this issue, where clicking on a spot would cause a crash based on race conditions towards the spot data structures.

While significant work was invested in refactoring code, there are still several problems with the code, some of them egregious. With this most significant work completed, further passes over the code might work to address those issues. Most notably, the factoring of the CBackgroundThreadLookup class could be improved.

K7ZCZ

2018-06-21 14:27

manager   ~0005368

Fixed with this checkin:
https://hrdsoftware.visualstudio.com/HRD/_versionControl/changeset/4174

K7ZCZ

2018-06-21 14:32

manager   ~0005369

Outwardly, nothing much should be noticed by customers for this change. It would be prudent to check that spots are evaluated and marked correctly. (If any issues are detected, a new Mantis issue is probably best.)

It's easier to notice the threading problems if the connecting parameters are set to a large number. Use the "Options" button in the DX Cluster bar, then activate the "Connection" tab. Under "after connecting, load Spots" to a high number, like 500. The larger number causes a higher impulse load on the Logbook when first connecting to the DX Cluster.

Users should note that the UI is responsive at all times with minimal stalling. And that the UI always updates its status visibly -- that is, the "## spots" and "## updating" counters are moving as the lists are populated.

Testing with multiple open logbook databases (with multiple databases configured to be active for lookups) is helpful.

g3ucq

2018-06-28 11:53

viewer   ~0005520

Unable to see these columns in Task Manager "Commit Size", "Threads", and "Memory (Private)" columns but the logbook was still responsive with 500 spots selected.

WA9PIE

2018-06-30 00:46

administrator   ~0005569

Thus far, this "feels" better to me.

WA9PIE

2018-06-30 21:12

administrator   ~0005578

I'm not sure if this is related or if it's just a higher volume of spots due to the KH1 DXpedition... but I'm seeing sluggish behavior in the DX cluster pane today.

Sluggish.PNG (54,304 bytes)
Sluggish.PNG (54,304 bytes)

g3ucq

2018-07-13 15:56

viewer   ~0005686

The Cluster is very quick for me showing 206 spots.

WA9PIE

2018-07-14 16:02

administrator   ~0005704

With the 864 build, the cluster seems noticeably faster. I was having trouble before. I'm using Access these days... but recently, after having some problems with DX Cluster performance, I changed the MaxBufferSize for Jet 2.x from 514 to 4000.

Here's some info on it:

http://www.casadebender.com/reference/other/bldapps/chapters/ba13_2.htm

I'm not sure if that has anything to do with it. But I've seen this make a considerable difference for others using Access for Logbook.

K7ZCZ

2018-07-14 22:26

manager   ~0005707

I don't think the advice in the linked article is appropriate for our customers; it's from 1996 (more than 20 years ago) and refers to Jet 3.5.

I think that we should address the performance problems in our application itself; poor memory usage, algorithms with exponential time complexity, inefficient memory management, poor threading models, redundant and inefficient queries, and poor data representation solutions for starters. Addressing these issues will be more impactful and provide greater benefit in improving the product as a whole.

g3ucq

2018-07-15 03:51

viewer   ~0005709

Mike's post in 5578 was also seen by Ron, W3RJW here
https://forums.hamradiodeluxe.com/node/46893
but no one else joined in. Must be OK now as he has not commented further.

WA9PIE

2018-07-17 16:55

administrator   ~0005726

I went through this scenario and I believe the problem that we were trying to fix is resolved. That is - a delay in loading initial spots.

Clearly, the method of loading them has been greatly improved. I like the way it's being done now.

Issue History

Date Modified Username Field Change
2017-07-05 14:04 K7ZCZ New Issue
2017-09-14 23:00 K7ZCZ Assigned To => K7ZCZ
2017-09-14 23:00 K7ZCZ Status new => assigned
2017-09-18 00:14 WA9PIE Project 3 - Current Dev List => 2 - Next Dev List (Holding Area)
2018-04-03 19:47 K7ZCZ Relationship added related to 0002580
2018-04-05 22:10 K7ZCZ Note Added: 0004709
2018-04-20 22:31 K7ZCZ Relationship added related to 0001913
2018-05-21 18:32 K7ZCZ Relationship added related to 0002732
2018-05-30 08:53 K7ZCZ Relationship added related to 0002752
2018-06-03 18:21 K7ZCZ Note Added: 0005190
2018-06-09 10:41 K7ZCZ Note Added: 0005228
2018-06-09 10:43 K7ZCZ Note Added: 0005229
2018-06-21 14:27 K7ZCZ Note Added: 0005368
2018-06-21 14:32 K7ZCZ Note Added: 0005369
2018-06-21 14:32 K7ZCZ Status assigned => resolved
2018-06-21 14:32 K7ZCZ Resolution open => fixed
2018-06-21 14:32 K7ZCZ Sub-Module (select) => DX Cluster
2018-06-21 14:32 K7ZCZ Testing => Not Started
2018-06-21 14:32 K7ZCZ Project 2 - Next Dev List (Holding Area) => 3 - Current Dev List
2018-06-28 10:07 K7ZCZ Fixed in Version => 6.4.0.858
2018-06-28 11:53 g3ucq Note Added: 0005520
2018-06-30 00:46 WA9PIE Note Added: 0005569
2018-06-30 21:12 WA9PIE File Added: Sluggish.PNG
2018-06-30 21:12 WA9PIE Note Added: 0005578
2018-07-13 15:56 g3ucq Note Added: 0005686
2018-07-14 16:02 WA9PIE Note Added: 0005704
2018-07-14 22:26 K7ZCZ Note Added: 0005707
2018-07-15 03:51 g3ucq Note Added: 0005709
2018-07-17 16:55 WA9PIE Status resolved => closed
2018-07-17 16:55 WA9PIE Description Updated View Revisions
2018-07-17 16:55 WA9PIE Testing Not Started => Beta Successful
2018-07-17 16:55 WA9PIE Note Added: 0005726
2018-07-23 22:52 WA9PIE Project 3 - Current Dev List => Ham Radio Deluxe
2018-07-24 15:35 WA9PIE Project Ham Radio Deluxe => 3 - Current Dev List
2018-07-25 08:00 WA9PIE Fixed in Version 6.4.0.858 => 6.4.0.873
2018-07-25 20:47 WA9PIE Project 3 - Current Dev List => Ham Radio Deluxe