Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCP: Fix scp_asynch_check cross-thread interference #333

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bscottm
Copy link
Contributor

@bscottm bscottm commented Dec 27, 2023

[Mark: I'm prepared for the tirade. There never was a "false positive" detected by the Clang/LLVM thread sanitizer.

  • "Lock-free" wasn't because AIO_ILOCK and AIO_IUNLOCK were set to AIO_LOCK and AIO_UNLOCK, respectively. They are now set to an empty preprocessor define. No net performance impact.
  • Using AIO_LOCK and AIO_UNLOCK in _sim_debug_write_flush to serialize debug output is a bit sketchy.
  • This refactoring uses an #ifdef/#endif forest and static inlines, rather than #define-ing compare-and-swap in terms of Microsoft's InterlockedCompareExchangePointer. This makes the code more maintainable and (arguably) more readable.
  • AIO_INIT is asymmetric between the lock-free and lock-based code paths, notably, sim_asynch_lock is recursive for lock-based, but not for lock-free. I didn't change the asymmetry, although AIO_INIT could be hoisted out of the lock-free/lock-based preprocessor conditionals and use a recursive mutex in both cases.]

[Note: I've tried this branch against the 3b2/400 SVR3 and PDP10-KL TOPS 20 v7 installs, both of which are I/O heavy and exercise multiple devices. Haven't experienced any lockups or issues.

Also: Does anyone actually compile SIMH for Intel Itanium on a DEC platform currently?]

Fix a Clang/LLVM sanitizer warning that scp_asynch_check is written by both the AIO and main threads, one thread potentially clobbering the other's value. The scp_asynch_check = 0 at scp.c:239 is where the thread sanitizer detects the issue.

scp_asynch_check = 0 is supposed to cause the main thread to process I/O updates on the next AIO_CHECK_EVENT code's execution. To preserve that behavior, AIO_CHECK_EVENT now executes AIO_UPDATE_QUEUE when either sim_asynch_check decrements below 0 or there is work to be done on the AIO queue (sim_asynch_queue != QUEUE_HEAD.)

Code refactoring:

  • AIO_ILOCK/AIO_IUNLOCK: As documented in sim_defs.h, these two macros are empty in the lock-free code case. Minor performance improvement.

  • AIO_QUEUE_VAL and AIO_QUEUE_SET are inlined static functions and directly use compiler intrinsics for the lock-free implementation, when available.

    • GCC/Clang: Use the __atomic intrinsics. __sync intrinsics have been deprecated since the C++11 standard's publication. The __sync intrinsics are still available and provide a fallback if GCC or Clang is really so old that the __atomic intrinsics are unavailable (insert Wallace Shawn's iconic "That's totally inconceivable!" here.)

    • AIO_QUEUE_VAL: Leverages the processor's read fence when available. Ensures that sim_asynch_queue's value is consistently read across cores and threads in the lock-free implementation for platforms without total store ordering or strong consistency guarantees (i.e., anything other than Intel and SPARC.)

    • AIO_QUEUE_SET: Compare-exchange to set sim_asynch_queue's head. The return value is a boolean (1/0) to reflect the success (1) or failure (0) of the compare-exchange.

  • AIO_CHECK_EVENT invokes AIO_ILOCK and AIO_IUNLOCK so that the lock-based code cannot alter sim_asynch_queue when checking to see if there is pending I/O work to be done. sim_asynch_lock is a recursive mutex in the lock-based case, so AIO_ILOCK/AIO_IUNLOCK will succeed for the lock-based implementation.

  • New builder script flag(s) to disable AIO lock-free, force AIO lock-based code:

    • cmake-builder.ps1 -noaiointrinsics ...
    • cmake-builder.sh -no-aio-intrinics ...

@markpizz
Copy link
Contributor

  1. You have actually provided precisely 0 cases which cause errant behavior in any simulator, so this solves a non-existent problem.

  2. You haven't actually measured the total CPU time consumed in your modified code vs the unmodified case.

[Note: I've tried this branch against the 3b2/400 SVR3 and PDP10-KL TOPS 20 v7 installs, both of which are I/O heavy and exercise multiple devices. Haven't experienced any lockups or issues.

  1. These test cases actually exercise percisely 0 asynch I/O activity.

@rcornwell
Copy link
Member

@markpizz Then please explain why when I run TOPS20 Panda under current SCP it appears to hang after a couple hours. If I disable async I/O it runs for hours. If I use Scott's patch it also runs for hours.

@markpizz
Copy link
Contributor

@rcornwell

Then please explain why when I run TOPS20 Panda under current SCP it appears to hang after a couple hours.

I'll be glad to track down and then precisely explain what's happening if you point at a configuration and disk setup and hopefully some minimal activities' needed to be performed to reproduce the hang.

@rcornwell
Copy link
Member

Download the simulator from sims repository.
set console wru=034 set cpu 4096K kl10b set tua rh20 set tua1 disable set tua2 disable set tua3 disable set tua4 disable set tua5 disable set tua6 disable set tua7 disable set rpa rh20 set rpb rh20 set rpb disable set ni enable set ni mac=12:34:56:78:9A:BC at ni tap:tap0 set cty rsx20 att lp20 -n prt.lst set rpa0 rp07 at rpa0 -f dbd9 RH20.RP07.1 att tty -U 2021 dep sw 254200000147 load boot.sav

`Start addr=040000
sim> cont

BOOT V11.0(315)

BOOT>tops20:

[BOOT: Loading] [OK]

System structure not found. What is its name? TOPS20

[TOPS20 mounted]
Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4 Internet: Loading host names [OK]

System res
Date and time is: Wednesday, 27-December-2023 12:18PM
Why reload? sa
Run CHECKD? n
DDMP: Started
`
When system gets quiet, ^C login operator dec-20

then if you leave it idle for a while it will eventually get hung waiting on a lock.

@markpizz
Copy link
Contributor

Thanks for the config.

Meanwhile, where can I find:

RH20.RP08.1

and:

boot.sav

@rcornwell
Copy link
Member

Download Panda distribution.

@markpizz
Copy link
Contributor

Download Panda distribution.

As I'm sure you know, I'm not a user of this simulator nor do I know anything about the panda distribution. A search for "panda distribution" shows various stuff about a species of bear from China.

Specific URL pointers to the 2 relevant files will help me get to debugging and will help anyone else reviewing this issue to also reproduce the failure and be comfortable with the ultimate fix.

@bscottm
Copy link
Contributor Author

bscottm commented Dec 28, 2023

  1. You have actually provided precisely 0 cases which cause errant behavior in any simulator, so this solves a non-existent problem.

Absence of evidence is not evidence of absence. TSan's warning is a good reason to re-evaluate the code, even if specific errant cases haven't been reported or just to verify that it isn't a false positive.

  1. You haven't actually measured the total CPU time consumed in your modified code vs the unmodified case.

The concern you expressed in a previous thread was a performance hit due to synchronized sim_asynch_check reads and writes. The proposed code synchronizes on sim_asynch_queue and avoids extraneous mutex locks/unlocks.

Here's a rough estimate using vax730 running the diagnostics as the baseline, which regularly trips the TSan warning. For AArch64 (Raspberry Pi 4, Debian), there's a 2.6% performance penalty to check sim_asynch_queue != QUEUE_HEAD. Win32 x86 (32-bit executable), the penalty is 4.02%. The discrepancy is most likely due to GCC using __atomic_load, whereas MS uses a full compare-exchange to reference sim_asynch_queue's value.

The Win32 rough estimate is probably pessimistic because MS doesn't make it easy to measure process execution statistics (vs. the time command on Unixes.)

[Note: I've tried this branch against the 3b2/400 SVR3 and PDP10-KL TOPS 20 v7 installs, both of which are I/O heavy and exercise multiple devices. Haven't experienced any lockups or issues.

  1. These test cases actually exercise percisely 0 asynch I/O activity.

Feel free to propose better test cases (NetBSD on VAXen?) 3b2 and pdp10-kl both call AIO_CHECK_EVENT in sim_instr, which made them both look like reasonable AIO test cases.

@bscottm
Copy link
Contributor Author

bscottm commented Dec 28, 2023

Download Panda distribution.

Not that hard to find on panda.trailing-edge.com.

@bscottm
Copy link
Contributor Author

bscottm commented Dec 29, 2023

  1. You haven't actually measured the total CPU time consumed in your modified code vs the unmodified case.

The concern you expressed in a previous thread was a performance hit due to synchronized sim_asynch_check reads and writes. The proposed code synchronizes on sim_asynch_queue and avoids extraneous mutex locks/unlocks.

Here's a rough estimate using vax730 running the diagnostics as the baseline, which regularly trips the TSan warning. For AArch64 (Raspberry Pi 4, Debian), there's a 2.6% performance penalty to check sim_asynch_queue != QUEUE_HEAD. Win32 x86 (32-bit executable), the penalty is 4.02%. The discrepancy is most likely due to GCC using __atomic_load, whereas MS uses a full compare-exchange to reference sim_asynch_queue's value.

The Win32 rough estimate is probably pessimistic because MS doesn't make it easy to measure process execution statistics (vs. the time command on Unixes.)

I'd still characterize these results as rough (3x BIN/vax730 warmup executions, 50x BIN/vax730 time executions) for AArch64, x86_64 and Win32 x86. Summary:

+---------------------------+-----------------+
| Platform                  | Result          |
+---------------------------+-----------------+
| Aarch64 Debian on RPi 4   | 2.7% slowdown   |
| x86_64 (WSL, Ubuntu)      | 3.5% speedup    |
| x86 (Win32)               | 4.0% slowdown   |
+---------------------------+-----------------+

timings.ods spreadsheet with collected timing (Libreoffice Calc or Excel.)

@bscottm
Copy link
Contributor Author

bscottm commented Dec 29, 2023

Refined the x86 Win32 results by wrangling C# and Powershell to get user and system times vs. total elapsed as reported by Measure-Command.

+---------------------------+-----------------+
| Platform                  | Result          |
+---------------------------+-----------------+
| Aarch64 Debian on RPi 4   | 2.7% slowdown   |
| x86_64 (WSL, Ubuntu)      | 3.5% speedup    |
| x86 (Win32)               | 6.7% slowdown   |
+---------------------------+-----------------+

timings.ods spreadsheet with collected timing (Libreoffice Calc or Excel.)

@markpizz
Copy link
Contributor

Not that hard to find on panda.trailing-edge.com.

Thanks. I've got it. Testing now.

So far:

  • on Ubuntu 22-04 it seems to run for hours after logging in and I haven't seen any case where I don't get a new @ prompt when I hit return. This is with asynch enabled.

  • on Windows it runs for a bit, but never gets to the point where ^C will prompt for login. This is what I see:

KL-10 simulator Open SIMH V4.1-0 Current        git commit id: 81d96274+uncommitted-changes
D:/Panda/PDP10-KL.ini-2> set noasync
Asynchronous I/O disabled
D:/Panda/PDP10-KL.ini-3> set console wru=034
D:/Panda/PDP10-KL.ini-4> set cpu 4096K kl10b
D:/Panda/PDP10-KL.ini-5> set tua rh20
D:/Panda/PDP10-KL.ini-6> set tua1 disable
D:/Panda/PDP10-KL.ini-7> set tua2 disable
D:/Panda/PDP10-KL.ini-8> set tua3 disable
D:/Panda/PDP10-KL.ini-9> set tua4 disable
D:/Panda/PDP10-KL.ini-10> set tua5 disable
D:/Panda/PDP10-KL.ini-11> set tua6 disable
D:/Panda/PDP10-KL.ini-12> set tua7 disable
D:/Panda/PDP10-KL.ini-13> set rpa rh20
D:/Panda/PDP10-KL.ini-14> set rpb rh20
D:/Panda/PDP10-KL.ini-15> set rpb disable
D:/Panda/PDP10-KL.ini-16> set ni enable
D:/Panda/PDP10-KL.ini-16> #set ni debug
D:/Panda/PDP10-KL.ini-17> #set debug -f stdout
D:/Panda/PDP10-KL.ini-19> set ni mac=12:34:56:78:9A:BC
D:/Panda/PDP10-KL.ini-20> att ni eth9
%SIM-INFO: Eth: opened OS device \Device\NPF_{2ACA984D-98DE-4784-8FFC-C3245CBD3A38} - vEthernet (LAN Virtual Switch)
D:/Panda/PDP10-KL.ini-20> #att ni tap:tap0
D:/Panda/PDP10-KL.ini-21> #att ni eth4
D:/Panda/PDP10-KL.ini-23> set cty rsx20
D:/Panda/PDP10-KL.ini-24> att lp20 -n prt.lst
%SIM-INFO: LP20: creating new file: prt.lst
D:/Panda/PDP10-KL.ini-25> set rpa0 rp07
D:/Panda/PDP10-KL.ini-26> copy RH20.RP07.1.original RH20.RP07.1
%SIM-INFO:         1 file(s) copied
D:/Panda/PDP10-KL.ini-27> att rpa0 -f dbd9 RH20.RP07.1
D:/Panda/PDP10-KL.ini-28> att tty -U 2021
%SIM-INFO: Listening on port 2021
D:/Panda/PDP10-KL.ini-29> dep sw 254200000147
D:/Panda/PDP10-KL.ini-30> load boot.sav
Start addr=040000
D:/Panda/PDP10-KL.ini-31> expect "BOOT>" send "tops20:\r"; CONTINUED:/Panda/PDP10-KL.ini-32> expect "System structure not found.  What is its name? " send "TOPS20\r"; CONTINUE
D:/Panda/PDP10-KL.ini-33> expect "Why reload? " send "sa\r"; CONTINUE
D:/Panda/PDP10-KL.ini-34> expect "Run CHECKD? " send "n\r"; CONTINUE
D:/Panda/PDP10-KL.ini-35> continue

BOOT V11.0(315)

BOOT>
Expect matched, PC: 773454 (JRST 0,773453)
D:/Panda/PDP10-KL.ini-35> send "tops20:\r"
D:/Panda/PDP10-KL.ini-35> CONTINUE
tops20:

[BOOT: Loading] [OK]

 System structure not found.  What is its name?
Expect matched, PC: 010134 (SKIPLE 0,6(2))
D:/Panda/PDP10-KL.ini-35> send "TOPS20\r"
D:/Panda/PDP10-KL.ini-35> CONTINUE
TOPS20

[TOPS20 mounted]
Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4 Internet: Loading host names [OK]

System res
Date and time is: Friday, 29-December-2023 6:19AM
Why reload?
Expect matched, PC: 134306 (IORI 2,100)
D:/Panda/PDP10-KL.ini-35> send "sa\r"
D:/Panda/PDP10-KL.ini-35> CONTINUE
sa
Run CHECKD?
Expect matched, PC: 134306 (IORI 2,100)
D:/Panda/PDP10-KL.ini-35> send "n\r"
D:/Panda/PDP10-KL.ini-35> CONTINUE
n
 DDMP: Started
[KNILDR: Loading microcode version 1(172) into Ethernet channel 0]


SYSJOB 7A(88)-4 started at 29-Dec-2023 0619
SJ  0: @LOGIN OPERATOR
SJ  0: @ENABLE
SJ  0: $SYSTEM:STSJ1
29-Dec-2023 06:19:09 SYSJB1: SYSJB1 started.
SJ  0: $^ESET LOGIN ANY
SJ  0: $OPR

 [NCP]:              Waiting for ORION to start
29-Dec-2023 06:19:09 SYSJB1: Job 0:
29-Dec-2023 06:19:09 SYSJB1: Job 0:  Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4
29-Dec-2023 06:19:09 SYSJB1: Job 1:
29-Dec-2023 06:19:09 SYSJB1: Job 1:  Panda Distribution, PANDA TOPS-20 Monitor
% [Logger 29-Dec-2023 06:19:10 ]: Started at 29-Dec-2023 06:19:09
7.1(21733)-4
29-Dec-2023 06:19:10 SYSJB1: Job 2:
29-Dec-2023 06:19:10 SYSJB1: Job 2:  Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4
29-Dec-2023 06:19:10 SYSJB1: Job 3:
29-Dec-2023 06:19:10 SYSJB1: Job 3:  Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4
29-Dec-2023 06:19:10 SYSJB1: Job 4:
29-Dec-2023 06:19:10 SYSJB1: Job 4:  Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4
29-Dec-2023 06:19:10 SYSJB1: Job 5:
29-Dec-2023 06:19:11 SYSJB1: Job 5:  Panda Distribution, PANDA TOPS-20 Monitor 7.1(21733)-4
29-Dec-2023 06:19:11 SYSJB1: Job 1: @LOGIN OPERATOR
29-Dec-2023 06:19:11 SYSJB1: Job 0: @LOGIN OPERATOR
29-Dec-2023 06:19:11 SYSJB1: Job 2: @LOGIN OPERATOR
29-Dec-2023 06:19:11 SYSJB1: Job 2: @ENABLE
29-Dec-2023 06:19:11 SYSJB1: Job 3: @LOGIN OPERATOR
29-Dec-2023 06:19:11 SYSJB1: Job 4: @LOGIN OPERATOR
29-Dec-2023 06:19:11 SYSJB1: Job 0: @ENABLE
29-Dec-2023 06:19:11 SYSJB1: Job 1: @ENABLE
29-Dec-2023 06:19:11 SYSJB1: Job 1: $NETSRV
29-Dec-2023 06:19:11 SYSJB1: Job 2: $SMTJFN
29-Dec-2023 06:19:11 SYSJB1: Job 3: @ENABLE
29-Dec-2023 06:19:11 SYSJB1: Job 4: @ENABLE
29-Dec-2023 06:19:11 SYSJB1: Job 5: @LOGIN
SJ  0:
SJ  0: 06:19:09          -- Can't rename file SPOOL:OPERATOR-SYSTEM.LOG --
SJ  0:                 Current LOG file will be appended
SJ  0: OPR>TAKE SYSTEM:SYSTEM.CMD
SJ  0:
SJ  0: 06:19:12        --ORION OPERATOR
29-Dec-2023 06:19:12 SYSJB1: Job 5: @ENABLE
29-Dec-2023 06:19:12 SYSJB1: Job 0: $RESOLV
29-Dec-2023 06:19:12 SYSJB1: Job 3: $MMAILR
29-Dec-2023 06:19:12 SYSJB1: Job 4: $IMAPSV
29-Dec-2023 06:19:12 SYSJB1: Job 5: $FTS
logging disabled by job 1 OPERATOR at terminal 13--
SJ  0:
SJ  0: 06:19:12        --Output display for OPR modified--
SJ  0:
SJ  0: 06:19:13        --Output display for OPR modified--
SJ  0:
SJ  0: 06:19:13        --Output display for OPR mo
29-Dec-2023 06:19:13 SYSJB1: Job 5: FTS>TAKE FTS.CMD
29-Dec-2023 06:19:13 SYSJB1: Job 5: [FTS20: FTS event 38: spooler started]
dified--
SJ  0:
SJ  0: 06:19:13        --Output display for OPR modified--
SJ  0:
SJ  0: 06:19:12        Batch-Stream 0  -- Set Accepted --
SJ  0:
SJ  0: 06:19:12        Batch-Stream 1  -- Set Accepted --
SJ  0:
SJ  0: 06:19:12        Batch-Stream 2  -- Set Accepted --
SJ  0:
SJ  0: 06:19:12        Batch-Stream 3  -- Set Accepted --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 0  -- Set Accepted --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 1  -- Set Accepted --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 2  -- Set Accepted --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 3  -- Set Accepted --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 0  -- Startup Scheduled --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 1  -- Startup Scheduled --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 2  -- Startup Scheduled --
SJ  0:
SJ  0: 06:19:13        Batch-Stream 3  -- Startup Scheduled --

06:19:15 From operator terminal 13 on noSJ  0: OPR>
SJ  0: 06:19:15        --SEND command completed--
SJ  0: OPR>
SJ  0: 06:19:19          -- Device status file initialization error --
SJ  0:                 Changed CPU
SJ  0:                 SYSTEM:DEVICE-STATUS.BIN file will be reset
SJ  0:
SJ  0: 06:19:19          -- Structure Status Change Detected --
SJ  0:                 Previously mounted structure TOPS20: detected
SJ  0:
SJ  0: 06:19:19          -- Structure Status Change Detected --
SJ  0:                 Structure state for structure TOPS20 is incorrect
SJ  0:                   EXCLUSIVE/SHARED attribute set incorrectly
SJ  0:                 Status of structure TOPS20: is set:
SJ  0:                 Domestic, Unregulated, Shared, Available, Dumpable
SJ  0:

On Windows I tried to attached the NI device to the local system LAN and to another interface without local traffic. No change in result. Again, same results whether SET ASYNC or SET NOASYNC. The above Windows case is with Scott's changes to scp.c and sim_defs.h applied to the code I tested with. I got the same behavior before applying his change and again with or without asynch enabled.

I don't think the Windows hang has anything to do with the point raised in this PR. Building that simulator on Windows emits these warnings:

1>kl10_dn.c
1>..\PDP10\kl10_dn.c(406) : warning C4102: 'enter_pri' : unreferenced label
1>..\PDP10\kl10_dn.c(368) : warning C4101: 'ch' : unreferenced local variable
1>..\PDP10\kl10_dn.c(656) : warning C4101: 'ch' : unreferenced local variable
1>..\PDP10\kl10_dn.c(717) : warning C4101: 'otty' : unreferenced local variable
1>..\PDP10\kl10_dn.c(908) : warning C4101: 'data1' : unreferenced local variable
1>..\PDP10\kl10_dn.c(910) : warning C4101: 'n' : unreferenced local variable
1>..\PDP10\kl10_dn.c(912) : warning C4101: 'save_ptr' : unreferenced local variable
1>..\PDP10\kl10_dn.c(911) : warning C4101: 'ln' : unreferenced local variable
1>..\PDP10\kl10_dn.c(909) : warning C4101: 'dataq' : unreferenced local variable
1>..\PDP10\kl10_dn.c(913) : warning C4101: 'ch' : unreferenced local variable
1>..\PDP10\kl10_dn.c(1035) : warning C4101: 'ch' : unreferenced local variable

@rcornwell
Copy link
Member

It is very hard to reproduce.... it gets stuck waiting for the AIO lock.

@markpizz
Copy link
Contributor

What about the windows case? With or without Async enabled and with or without Scott's patch?

@rcornwell
Copy link
Member

I don't do a lot of testing on Windows. I tried it with async with Scott's patch and did not see the error. I was also experiencing similar issues with B5500 emulator. They went away with Scott's patch.

@markpizz
Copy link
Contributor

@rcornwell :

I don't do a lot of testing on Windows. I tried it with async with Scott's patch and did not see the error.

but you also say:

It is very hard to reproduce.... it gets stuck waiting for the AIO lock.

Given that, as you say "It is very hard to reproduce", and my lack of being able to reproduce it, maybe your test didn't create the hang... The Ubuntu test has been running without issue for coming up on a day... The windows test can't even reproduce the working case with or without scott's patches and with or without asynch being enabled.

So I'm still looking for a case where this PR fixes an observable problem.

I was also experiencing similar issues with B5500 emulator.

Scott's patch had absolutely no chance of changing anything in the B5500. That simulator doesn't use sim_disk, sim_tape or sim_ether which are the only cases that could potentially use asynch I/O. The B5500 doesn't even have any asynch code (which his changes are part of) compiled in.

What platform (OS and architecture) were you experiencing the problems on? What compiler was used to build the failing cases?

@bscottm :

Feel free to propose better test cases (NetBSD on VAXen?) 3b2 and pdp10-kl both call AIO_CHECK_EVENT in sim_instr, which made them both look like reasonable AIO test cases.

They do call AIO_CHECK_EVENT due to potential activities in their simulated ethernet devices. When tested with no ethernet traffic, the tests provide very little or no functional test of your changes.

[Note: I've tried this branch against the 3b2/400 SVR3 and PDP10-KL TOPS 20 v7 installs, both of which are I/O heavy and exercise multiple devices. Haven't experienced any lockups or issues.

Doing OS installs to test this doesn't involve any activity on the only devices in these simulators that can potentially do Asynch I/O (their network devices are the only cases in these simulators).

Fix a Clang/LLVM sanitizer warning that scp_asynch_check is written by both the AIO and main threads, one thread potentially clobbering the other's value. The scp_asynch_check = 0 at scp.c:239 is where the thread sanitizer detects the issue.

scp_asynch_check = 0 is supposed to cause the main thread to process I/O updates on the next AIO_CHECK_EVENT code's execution. To preserve that behavior, AIO_CHECK_EVENT now executes AIO_UPDATE_QUEUE when either sim_asynch_check decrements below 0 or there is work to be done on the AIO queue (sim_asynch_queue != QUEUE_HEAD.)

The setting of scp_asynch_check = 0 provides a signal that says "it would be nice if your asynch queue check happened on the next instruction execution". If it doesn't happen, no big deal since it will end up happening within a few instructions anyway. A several percent simulation slowdown for ALL workloads just to achieve this seems to me to be unjustified. That several percent performance gain was a deliberate design decision when this functionality was added. I had originally just believed there would be an advantage with this approach. Thanks for measuring the gain.

Feel free to propose better test cases (NetBSD on VAXen?) 3b2 and pdp10-kl both call AIO_CHECK_EVENT in sim_instr, which made them both look like reasonable AIO test cases.

They do call AIO_CHECK_EVENT due to potential activities in their simulated ethernet devices. When tested with no ethernet traffic, the tests provide very little or no functional test of your changes.

A rigorous test would involve activity that deals with much network AND much disk activity, where the disk activity uses the pdp11_rq device which is the only disk that does asynch disk I/O.

@rcornwell
Copy link
Member

Perhaps try telneting into TOPS20 or telneting out of it. Then waiting for it to get hung in bug. All I know is that I was seeing several of my simulators getting stuck in waits for AIO lock. I applied his patch and the waits disappeared. I also don't see any performance hit from his patches. So I am ok with them.

@bscottm
Copy link
Contributor Author

bscottm commented Dec 30, 2023

Perhaps try telneting into TOPS20 or telneting out of it. Then waiting for it to get hung in bug. All I know is that I was seeing several of my simulators getting stuck in waits for AIO lock. I applied his patch and the waits disappeared. I also don't see any performance hit from his patches. So I am ok with them.

@markpizz: Richard is not wrong and I think this PR fixes a long time issue with the IBM 1130 and SEL simulators. These two simulators have spurious failures on the CI/CD builds. IIRC, you implemented an instruction limit and timeout so that the GH CI/CD wouldn't hang indefinitely when these simulators' tests went astray.

I experimented with InterlockedExchangePointer(&sim_asynch_queue, sim_async_queue) to see if that could be used as an atomic load on Win32. According to the docs, this should return the previous value of sim_asynch_queue, while storing sim_asynch_queue on itself. However, due to instruction reordering, the volatile read happens way before the actual exchange. In the meantime, sim_asynch_queue's value changed before the exchange (verified by printf()-s) and the exchange ended up clobbering the current sim_asynch_queue value. Consequently, ibm1130, sel32 and all of the VAX simulators failed their diagnostic tests -- this indirectly verifies Richard's observation that something interesting happens to acquire the sim_asynch_lock and the simulator ends up deadlocked.

Bottom line: I haven't been able to get the IBM 1130 or SEL32 simulators to hang indefinitely using this patch, whereas the hangs do happen from time to time with the existing AIO code. (The punchline for the Win32 code is: use compare-exchange, there is no atomic load.)

@markpizz
Copy link
Contributor

@bscottm

Bottom line: I haven't been able to get the IBM 1130 or SEL32 simulators to hang indefinitely using this patch, whereas the hangs do happen from time to time with the existing AIO code.

That you can't get these to fail is proof of exactly what? I never said that my inability to see failure of @rcornwell's pdp10-kl was proof of anything. If he sees failures, I want to see one so I can precisely know what is actually happening.

Meanwhile, as to your "proof" ... If you look carefully the code you are changing in your most recent update can't possibly have any effect on the IBM 1130 since the change you're making (or the original code) is only visible to a simulator when the simulator is compiled with SIM_ASYNCH_IO defined. The makefile and the Visual Studio project builds DO NOT define this when building the IBM1130. Maybe you are defining it in your cmake builds... @rcornwell's sims repo only has makefile and visual studio project build support.

As for the SEL32 build: That simulator is built with SIM_ASYNC_IO defined (as it should be since it, in theory, leverages sim_ether's asynch support). Meanwhile, the build, even when running the author's tests, does not exercise any network activities, and as such wouldn't be traversing the code you're messing with. Likewise for the vax running its instruction diagnostics.

To quote someone earlier in this discussion: "Absence of evidence is not evidence of absence."

(The punchline for the Win32 code is: use compare-exchange, there is no atomic load.)

This may indeed be appropriate.

The comments in your most recent change have language that says: "Do Not Ever use InterlockedExchangePointer() to read sim_asynch_queue. This results in failures across the IBM1130, SEL32 and VAXen simulators." It is truly amazing that I have had a VAX simulator running continuously (whenever my Windows Server is up) for at least the last 12 years and it has never had any sort of a failure like this. This system does tons of network and disk I/O and doesn't merely run instruction diagnostics.

It would be best if you created a specific test demonstrating interactions with scp, sim_instr() and devices that perform asynch I/O that proves (or disproves) your assertion and that can readily demonstrate the problem you're fixing on one or more supported platforms. Without a hard proof of failure and some change that unambiguously fixes it, making changes like this are at best wishful thinking. Under real simulator conditions, devices insert events into the queue on the order of dozen's of times a second, while sim_instr() examines the queue likely tens to hundreds of thousands of times per second while executing 10's of millions of instructions per second. The usual case would involve at most 2 or 3 devices slightly busy at the same time with exceedingly rare concurrent queue insertions.

IIRC, you implemented an instruction limit and timeout so that the GH CI/CD wouldn't hang indefinitely when these simulators' tests went astray.

I did implement the optional CPU instruction/time limit, but the failures that were happening all related to errant test behaviors or less than perfect test completion checks.

@bscottm
Copy link
Contributor Author

bscottm commented Dec 31, 2023

Bottom line: I haven't been able to get the IBM 1130 or SEL32 simulators to hang indefinitely using this patch, whereas the hangs do happen from time to time with the existing AIO code.

That you can't get these to fail is proof of exactly what? I never said that my inability to see failure of @rcornwell's pdp10-kl was proof of anything. If he sees failures, I want to see one so I can precisely know what is actually happening.

I can, however, cause SEL32 and IBM 1130 to fail repeatedly with a slight tweak, lending credence that the existing AIO code as an issue that needs further investigation, because the failure is in the AIO code. How the existing code ends up in a deadlock might prove fruitful; having a working patch that solves several simulators' issue is the proverbial "bird in hand."

Meanwhile, as to your "proof" ... If you look carefully the code you are changing in your most recent update can't possibly have any effect on the IBM 1130 since the change you're making (or the original code) is only visible to a simulator when the simulator is compiled with SIM_ASYNCH_IO defined. The makefile and the Visual Studio project builds DO NOT define this when building the IBM1130. Maybe you are defining it in your cmake builds... @rcornwell's sims repo only has makefile and visual studio project build support.

CMake build defines SIM_ASYNC_IO if thread support is detected for all simulators. So, it's visible.

As for the SEL32 build: That simulator is built with SIM_ASYNC_IO defined (as it should be since it, in theory, leverages sim_ether's asynch support). Meanwhile, the build, even when running the author's tests, does not exercise any network activities, and as such wouldn't be traversing the code you're messing with. Likewise for the vax running its instruction diagnostics.

And yet it ends up waiting for some AIO request to complete because sim_asynch_queue != QUEUE_HEAD running the tests, and by unintentionally not allowing that AIO request to complete, I can cause the simulator to consistently exceed the instruction count, as does IBM 1130.

To quote someone earlier in this discussion: "Absence of evidence is not evidence of absence."

To the contrary, I think I found evidence. :-)

The comments in your most recent change have language that says: "Do Not Ever use InterlockedExchangePointer() to read sim_asynch_queue. This results in failures across the IBM1130, SEL32 and VAXen simulators."

Maybe I should push a commit that ensures that InterlockedExchangePointer is used instead of InterlockedCompareExchangePointer just so you can see what happens. I have significant misgivings about pushing code that inentionally demonstrates breakage, but if that's the only way to convince you...

@markpizz
Copy link
Contributor

markpizz commented Jan 3, 2024

I'm digging into some of what you've explored.

Meanwhile, these relevant facts should influence your thinking:

1 - You are always building simulators with SIM_ASYNC_IO defined. The makefile builds VERY SPECIFICALLY only compile with SIM_ASYNC_IO defined when code in the specific simulator has devices which leverage asynchronous I/O. The reason being the performance impact of the thread synchronization overhead when it absolutely isn't needed by these simulators. My measurements of the cost of this overhead is between 300% and 400% of the time spent in the various simh event management routines (sim_activate*, sim_cancel, etc.). You should either adjust the cmake builds for these simulators to avoid building with SIM_ASYNC_IO defined, or folks should be advised to build with the makefile to get the fastest simulator results.

2 - Your conclusion that whatever problem the IBM1130 simulator has in the CI/CD testing is related to the changes in this PR is wrong. This simulator does ABSOLUTELY no asynchronous I/O and is completely single threaded. When built with the makefile (without SIM_ASYNC_IO) in the earlier makefile based CI/CD, the test runs intermittently failed there AND NONE of the concerns you raise about InterlockedCompareExchangePointer are a factor since it is never used, If the PR code seems like it "happens" to have fixed the problem, you have just concealed it, or the intermittent activity merely hasn't occurred yet.

@rcornwell
The SET NOASYNC doesn't change the behavior of the PDP10-KL simulator in any way. It merely disables the use of asynch I/O for devices which use sim_disk and sim_tape leveraging asynch I/O, while your simulators don't do this. It was an oversight that SET NOASYNC doesn't disable asynch I/O for devices which use sim_ether. I'm considering a PR to have SET NOASYNC also affect sim_ether. Devices using sim_ether don't have to use asynch I/O, it just provides a way to avoid polling for incoming packets and thus possibly be better behaved when idling.

@bscottm
Copy link
Contributor Author

bscottm commented Jan 3, 2024

I'm digging into some of what you've explored.

My measurements of the cost of this overhead is between 300% and 400% of the time spent in the various simh event management routines (sim_activate*, sim_cancel, etc.).

Evidence?

2 - Your conclusion that whatever problem the IBM1130 simulator has in the CI/CD testing is related to the changes in this PR is wrong.

Or it's a Heisenbug related to grabbing sim_asynch_lock. Still investigating that possibility.

@rcornwell
Copy link
Member

@markpizz I am not going to argue with you over this. I was having issues with Panda on pdp10-kl, I set noasync and the issue went away. @bscottm ask me (and others) to test his patch. I tried it and my issue with pdp10-kl went away. I do know then under gdb I was able to see that the simulator was stuck waiting for one of the AIO locks.

PDP10-kl does use async traffic for networking.

@markpizz
Copy link
Contributor

markpizz commented Jan 3, 2024

Evidence?

If you just use your eyes and look at the additional lines of code executed in the sim_activate*, sim_cancel, and sim_aio_check_event and compare them to the simple code paths without any asynch stuff, my seat of the pants would have been more than the 3-4X I measured.

I've been busy but am still digging into this.

Or it's a Heisenbug related to grabbing sim_asynch_lock. Still investigating that possibility.

Well, "Heisenbug" isn't in the picture since there is no sim_asynch_lock when compiled without SIM_ASYNC_IO. Build without asynch support and the IBM1130 intermittently failed the original CI/CD activities.

@markpizz
Copy link
Contributor

markpizz commented Jan 3, 2024

If you just use your eyes and look at the additional lines of code executed in the sim_activate*, sim_cancel, and sim_aio_check_event and compare them to the simple code paths without any asynch stuff, my seat of the pants would have been more than the 3-4X I measured.

Again, this is specifically for the case where simulators DO NOT want or need to be build with SIM_ASYNC_IO defined since everything they do is done completely within one thread!

@pkoning2
Copy link
Member

pkoning2 commented Jan 3, 2024

I'm starting to wonder why SYM_ASYNC_IO should be around in the first place.

@rcornwell
Copy link
Member

@pkoning2 I agree here... the only places where I can see an I/O callback are for ethernet or tmxr data receiving. Generally for the level of simH simulators asynchronous I/O does not buy much. Also for most (other then perhaps VAX) they can't really take advantage of async disk I/O.

@bscottm
Copy link
Contributor Author

bscottm commented Jan 4, 2024

Evidence?

If you just use your eyes and look at the additional lines of code executed in the sim_activate*, sim_cancel, and sim_aio_check_event and compare them to the simple code paths without any asynch stuff, my seat of the pants would have been more than the 3-4X I measured.

Here're the abbreviated timings from ctest across master and scp-async-check branches for the simulators that run diagnostics from the GH CI/CD pipeline. It's about the best one can do with respect to comparing the branches. There is no appreciable difference in runtimes. If you have actual evidence that there's a 300-400% penalty between AIO implementations, I'd really like to see actual test evidence and the testing methodology.

While AIO is infrequently used in most simulators, what counts is the overhead from checking a synchronized variable in a simulator's sim_instr loop via AIO_CHECK_EVENT. The overhead penalty for properly written lock-free code that preserves the original code's intent is minimal.

With respect to defining SIM_ASYNCH_IO, the makefile globally defines that on the command line for ALL simulators in OS_CCDEFS when it determines thread support exists. SIM_ASYNCH_IO is not per-simulator.

master, Ubuntu 20.04
2023-12-18T20:28:04.1836262Z       Start 14: simh-i650
2023-12-18T20:28:05.9387735Z 14/78 Test #14: simh-i650 ........................   Passed    1.75 sec
2023-12-18T20:28:26.2290764Z       Start 43: simh-pdp9
2023-12-18T20:28:27.3342295Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.10 sec
2023-12-18T20:28:27.3343212Z       Start 44: simh-pdp8
2023-12-18T20:28:35.8361705Z 44/78 Test #44: simh-pdp8 ........................   Passed    8.50 sec
2023-12-18T20:28:38.7437347Z       Start 49: simh-sel32
2023-12-18T20:29:05.2704707Z 49/78 Test #49: simh-sel32 .......................   Passed   26.53 sec
2023-12-18T20:29:16.7551876Z       Start 63: simh-vax730
2023-12-18T20:29:29.2744983Z 63/78 Test #63: simh-vax730 ......................   Passed   12.52 sec
2023-12-18T20:29:29.2746097Z       Start 64: simh-vax750
2023-12-18T20:29:40.9108316Z 64/78 Test #64: simh-vax750 ......................   Passed   11.64 sec
2023-12-18T20:29:40.9109244Z       Start 65: simh-vax780
2023-12-18T20:29:53.3705976Z 65/78 Test #65: simh-vax780 ......................   Passed   12.46 sec
2023-12-18T20:29:53.3707310Z       Start 66: simh-vax8200
2023-12-18T20:30:15.5499331Z 66/78 Test #66: simh-vax8200 .....................   Passed   22.18 sec
2023-12-18T20:30:15.5500266Z       Start 67: simh-vax8600
2023-12-18T20:30:30.7624850Z 67/78 Test #67: simh-vax8600 .....................   Passed   15.21 sec

scp-async-check, Ubuntu 20.04
2023-12-30T22:50:32.8525317Z       Start 14: simh-i650
2023-12-30T22:50:34.6236395Z 14/78 Test #14: simh-i650 ........................   Passed    1.77 sec
2023-12-30T22:50:54.9277926Z       Start 43: simh-pdp9
2023-12-30T22:50:56.0612078Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.13 sec
2023-12-30T22:50:56.0614294Z       Start 44: simh-pdp8
2023-12-30T22:51:04.4765588Z 44/78 Test #44: simh-pdp8 ........................   Passed    8.42 sec
2023-12-30T22:51:07.3414890Z       Start 49: simh-sel32
2023-12-30T22:51:33.8405408Z 49/78 Test #49: simh-sel32 .......................   Passed   26.50 sec
2023-12-30T22:51:45.4362515Z       Start 63: simh-vax730
2023-12-30T22:51:58.0541034Z 63/78 Test #63: simh-vax730 ......................   Passed   12.62 sec
2023-12-30T22:51:58.0542485Z       Start 64: simh-vax750
2023-12-30T22:52:09.6462430Z 64/78 Test #64: simh-vax750 ......................   Passed   11.59 sec
2023-12-30T22:52:09.6463524Z       Start 65: simh-vax780
2023-12-30T22:52:21.9231696Z 65/78 Test #65: simh-vax780 ......................   Passed   12.28 sec
2023-12-30T22:52:21.9232876Z       Start 66: simh-vax8200
2023-12-30T22:52:43.8002800Z 66/78 Test #66: simh-vax8200 .....................   Passed   21.88 sec
2023-12-30T22:52:43.8003874Z       Start 67: simh-vax8600
2023-12-30T22:52:59.0384231Z 67/78 Test #67: simh-vax8600 .....................   Passed   15.24 sec

master, Ubuntu 22.04
2023-12-18T20:27:02.3618749Z       Start 14: simh-i650
2023-12-18T20:27:04.1658463Z 14/78 Test #14: simh-i650 ........................   Passed    1.80 sec
2023-12-18T20:27:24.9865117Z       Start 43: simh-pdp9
2023-12-18T20:27:26.1299870Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.14 sec
2023-12-18T20:27:26.1301400Z       Start 44: simh-pdp8
2023-12-18T20:27:34.3729348Z 44/78 Test #44: simh-pdp8 ........................   Passed    8.24 sec
2023-12-18T20:27:37.3409359Z       Start 49: simh-sel32
2023-12-18T20:28:04.7105242Z 49/78 Test #49: simh-sel32 .......................   Passed   27.37 sec
2023-12-18T20:28:16.5344480Z       Start 63: simh-vax730
2023-12-18T20:28:28.9495993Z 63/78 Test #63: simh-vax730 ......................   Passed   12.41 sec
2023-12-18T20:28:28.9497402Z       Start 64: simh-vax750
2023-12-18T20:28:40.9250097Z 64/78 Test #64: simh-vax750 ......................   Passed   11.97 sec
2023-12-18T20:28:40.9251711Z       Start 65: simh-vax780
2023-12-18T20:28:53.4627477Z 65/78 Test #65: simh-vax780 ......................   Passed   12.54 sec
2023-12-18T20:28:53.4628840Z       Start 66: simh-vax8200
2023-12-18T20:29:15.9236492Z 66/78 Test #66: simh-vax8200 .....................   Passed   22.46 sec
2023-12-18T20:29:15.9237424Z       Start 67: simh-vax8600
2023-12-18T20:29:31.1565508Z 67/78 Test #67: simh-vax8600 .....................   Passed   15.23 sec

scp-async-check, Ubuntu 22.04
2023-12-30T22:49:26.1687755Z       Start 14: simh-i650
2023-12-30T22:49:27.9519990Z 14/78 Test #14: simh-i650 ........................   Passed    1.78 sec
2023-12-30T22:49:48.7653625Z       Start 43: simh-pdp9
2023-12-30T22:49:49.8889200Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.12 sec
2023-12-30T22:49:49.8890144Z       Start 44: simh-pdp8
2023-12-30T22:49:58.1304033Z 44/78 Test #44: simh-pdp8 ........................   Passed    8.24 sec
2023-12-30T22:50:01.0948594Z       Start 49: simh-sel32
2023-12-30T22:50:29.1131311Z 49/78 Test #49: simh-sel32 .......................   Passed   28.02 sec
2023-12-30T22:50:40.9503989Z       Start 63: simh-vax730
2023-12-30T22:50:53.4186019Z 63/78 Test #63: simh-vax730 ......................   Passed   12.47 sec
2023-12-30T22:50:53.4187566Z       Start 64: simh-vax750
2023-12-30T22:51:04.9010317Z 64/78 Test #64: simh-vax750 ......................   Passed   11.48 sec
2023-12-30T22:51:04.9012249Z       Start 65: simh-vax780
2023-12-30T22:51:17.3799087Z 65/78 Test #65: simh-vax780 ......................   Passed   12.48 sec
2023-12-30T22:51:17.3800685Z       Start 66: simh-vax8200
2023-12-30T22:51:39.8709947Z 66/78 Test #66: simh-vax8200 .....................   Passed   22.49 sec
2023-12-30T22:51:39.8711221Z       Start 67: simh-vax8600
2023-12-30T22:51:54.7841455Z 67/78 Test #67: simh-vax8600 .....................   Passed   14.91 sec

master, macos-12
2023-12-18T20:38:43.7856970Z       Start 14: simh-i650
2023-12-18T20:38:53.1833780Z 14/78 Test #14: simh-i650 ........................   Passed    9.40 sec
2023-12-18T20:39:32.3535090Z       Start 43: simh-pdp9
2023-12-18T20:39:39.5764550Z 43/78 Test #43: simh-pdp9 ........................   Passed    7.22 sec
2023-12-18T20:39:39.5767520Z       Start 44: simh-pdp8
2023-12-18T20:40:07.2912340Z 44/78 Test #44: simh-pdp8 ........................   Passed   27.69 sec
2023-12-18T20:40:13.3668600Z       Start 49: simh-sel32
2023-12-18T20:41:03.2762050Z 49/78 Test #49: simh-sel32 .......................   Passed   49.88 sec
2023-12-18T20:42:35.7571800Z       Start 63: simh-vax730
2023-12-18T20:43:02.9621850Z 63/78 Test #63: simh-vax730 ......................   Passed   27.20 sec
2023-12-18T20:43:02.9632610Z       Start 64: simh-vax750
2023-12-18T20:43:29.2195630Z 64/78 Test #64: simh-vax750 ......................   Passed   26.26 sec
2023-12-18T20:43:29.2205800Z       Start 65: simh-vax780
2023-12-18T20:44:03.3664110Z 65/78 Test #65: simh-vax780 ......................   Passed   34.15 sec
2023-12-18T20:44:03.3681640Z       Start 66: simh-vax8200
2023-12-18T20:44:49.7666580Z 66/78 Test #66: simh-vax8200 .....................   Passed   46.40 sec
2023-12-18T20:44:49.7704860Z       Start 67: simh-vax8600
2023-12-18T20:45:19.7224060Z 67/78 Test #67: simh-vax8600 .....................   Passed   29.96 sec

scp-asynch-check, macos-12
2023-12-30T22:49:48.7536750Z       Start 14: simh-i650
2023-12-30T22:49:58.1462660Z 14/78 Test #14: simh-i650 ........................   Passed    9.39 sec
2023-12-30T22:51:16.2664470Z       Start 43: simh-pdp9
2023-12-30T22:51:28.4704520Z 43/78 Test #43: simh-pdp9 ........................   Passed   12.20 sec
2023-12-30T22:51:28.4707910Z       Start 44: simh-pdp8
2023-12-30T22:51:53.5352510Z 44/78 Test #44: simh-pdp8 ........................   Passed   25.06 sec
2023-12-30T22:52:16.8158880Z       Start 49: simh-sel32
2023-12-30T22:52:57.6537330Z 49/78 Test #49: simh-sel32 .......................   Passed   40.84 sec
2023-12-30T22:55:06.4621770Z       Start 63: simh-vax730
2023-12-30T22:55:32.9600190Z 63/78 Test #63: simh-vax730 ......................   Passed   26.50 sec
2023-12-30T22:55:32.9603800Z       Start 64: simh-vax750
2023-12-30T22:55:59.9299370Z 64/78 Test #64: simh-vax750 ......................   Passed   26.97 sec
2023-12-30T22:55:59.9302050Z       Start 65: simh-vax780
2023-12-30T22:56:27.6543340Z 65/78 Test #65: simh-vax780 ......................   Passed   27.72 sec
2023-12-30T22:56:27.6548040Z       Start 66: simh-vax8200
2023-12-30T22:57:09.7845010Z 66/78 Test #66: simh-vax8200 .....................   Passed   42.13 sec
2023-12-30T22:57:09.7846990Z       Start 67: simh-vax8600
2023-12-30T22:57:39.9672630Z 67/78 Test #67: simh-vax8600 .....................   Passed   30.18 sec

master, macos-11
2023-12-18T21:03:04.7011550Z       Start 14: simh-i650
2023-12-18T21:03:13.2134400Z 14/78 Test #14: simh-i650 ........................   Passed    8.51 sec
2023-12-18T21:06:02.0609650Z       Start 43: simh-pdp9
2023-12-18T21:06:09.3394660Z 43/78 Test #43: simh-pdp9 ........................   Passed    7.28 sec
2023-12-18T21:06:09.3398500Z       Start 44: simh-pdp8
2023-12-18T21:06:28.6204620Z 44/78 Test #44: simh-pdp8 ........................   Passed   19.28 sec
2023-12-18T21:06:51.8280340Z       Start 49: simh-sel32
2023-12-18T21:07:32.0508900Z 49/78 Test #49: simh-sel32 .......................   Passed   40.22 sec
2023-12-18T21:08:52.8460150Z       Start 63: simh-vax730
2023-12-18T21:09:17.0501860Z 63/78 Test #63: simh-vax730 ......................   Passed   24.20 sec
2023-12-18T21:09:17.0505620Z       Start 64: simh-vax750
2023-12-18T21:09:40.0051850Z 64/78 Test #64: simh-vax750 ......................   Passed   22.95 sec
2023-12-18T21:09:40.0055770Z       Start 65: simh-vax780
2023-12-18T21:10:04.9481220Z 65/78 Test #65: simh-vax780 ......................   Passed   24.94 sec
2023-12-18T21:10:04.9486120Z       Start 66: simh-vax8200
2023-12-18T21:10:45.7432590Z 66/78 Test #66: simh-vax8200 .....................   Passed   40.79 sec
2023-12-18T21:10:45.7449970Z       Start 67: simh-vax8600
2023-12-18T21:11:14.4834640Z 67/78 Test #67: simh-vax8600 .....................   Passed   28.74 sec

scp-async-check, macos-11
2023-12-30T23:18:46.6456990Z       Start 14: simh-i650
2023-12-30T23:18:55.2285460Z 14/78 Test #14: simh-i650 ........................   Passed    8.58 sec
2023-12-30T23:21:43.6067390Z       Start 43: simh-pdp9
2023-12-30T23:21:50.6427700Z 43/78 Test #43: simh-pdp9 ........................   Passed    7.04 sec
2023-12-30T23:21:50.6431320Z       Start 44: simh-pdp8
2023-12-30T23:22:11.4797930Z 44/78 Test #44: simh-pdp8 ........................   Passed   20.84 sec
2023-12-30T23:22:35.5608740Z       Start 49: simh-sel32
2023-12-30T23:23:14.6191720Z 49/78 Test #49: simh-sel32 .......................   Passed   39.06 sec
2023-12-30T23:24:36.6257880Z       Start 63: simh-vax730
2023-12-30T23:25:00.9677550Z 63/78 Test #63: simh-vax730 ......................   Passed   24.34 sec
2023-12-30T23:25:00.9681520Z       Start 64: simh-vax750
2023-12-30T23:25:23.9589520Z 64/78 Test #64: simh-vax750 ......................   Passed   22.99 sec
2023-12-30T23:25:23.9593040Z       Start 65: simh-vax780
2023-12-30T23:25:48.3434290Z 65/78 Test #65: simh-vax780 ......................   Passed   24.38 sec
2023-12-30T23:25:48.3438740Z       Start 66: simh-vax8200
2023-12-30T23:26:28.7560090Z 66/78 Test #66: simh-vax8200 .....................   Passed   40.41 sec
2023-12-30T23:26:28.7563620Z       Start 67: simh-vax8600
2023-12-30T23:26:56.7064660Z 67/78 Test #67: simh-vax8600 .....................   Passed   27.95 sec

master, windows-latest
2023-12-18T20:34:46.7583997Z       Start 14: simh-i650
2023-12-18T20:34:53.2716671Z 14/78 Test #14: simh-i650 ........................   Passed    6.51 sec
2023-12-18T20:35:21.9442907Z       Start 43: simh-pdp9
2023-12-18T20:35:23.4178588Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.47 sec
2023-12-18T20:35:23.4181042Z       Start 44: simh-pdp8
2023-12-18T20:35:32.2809079Z 44/78 Test #44: simh-pdp8 ........................   Passed    8.86 sec
2023-12-18T20:35:35.8733987Z       Start 49: simh-sel32
2023-12-18T20:36:04.9026768Z 49/78 Test #49: simh-sel32 .......................   Passed   29.03 sec
2023-12-18T20:36:27.0287980Z       Start 63: simh-vax730
2023-12-18T20:36:39.3510162Z 63/78 Test #63: simh-vax730 ......................   Passed   12.32 sec
2023-12-18T20:36:39.3512952Z       Start 64: simh-vax750
2023-12-18T20:36:51.5914491Z 64/78 Test #64: simh-vax750 ......................   Passed   12.24 sec
2023-12-18T20:36:51.5917902Z       Start 65: simh-vax780
2023-12-18T20:37:04.7854803Z 65/78 Test #65: simh-vax780 ......................   Passed   13.19 sec
2023-12-18T20:37:04.7858054Z       Start 66: simh-vax8200
2023-12-18T20:37:26.9994851Z 66/78 Test #66: simh-vax8200 .....................   Passed   22.21 sec
2023-12-18T20:37:26.9997978Z       Start 67: simh-vax8600
2023-12-18T20:37:42.9728713Z 67/78 Test #67: simh-vax8600 .....................   Passed   15.97 sec

scp-async-check, windows-latest
2023-12-30T22:55:10.5304235Z       Start 14: simh-i650
2023-12-30T22:55:15.2424111Z 14/78 Test #14: simh-i650 ........................   Passed    4.71 sec
2023-12-30T22:55:45.1589062Z       Start 43: simh-pdp9
2023-12-30T22:55:46.6689113Z 43/78 Test #43: simh-pdp9 ........................   Passed    1.51 sec
2023-12-30T22:55:46.6691734Z       Start 44: simh-pdp8
2023-12-30T22:55:55.6667731Z 44/78 Test #44: simh-pdp8 ........................   Passed    9.00 sec
2023-12-30T22:55:59.7842585Z       Start 49: simh-sel32
2023-12-30T22:56:29.0803876Z 49/78 Test #49: simh-sel32 .......................   Passed   29.30 sec
2023-12-30T22:56:51.3840501Z       Start 63: simh-vax730
2023-12-30T22:57:03.7043321Z 63/78 Test #63: simh-vax730 ......................   Passed   12.32 sec
2023-12-30T22:57:03.7046522Z       Start 64: simh-vax750
2023-12-30T22:57:16.9841890Z 64/78 Test #64: simh-vax750 ......................   Passed   13.28 sec
2023-12-30T22:57:16.9844383Z       Start 65: simh-vax780
2023-12-30T22:57:30.2095990Z 65/78 Test #65: simh-vax780 ......................   Passed   13.22 sec
2023-12-30T22:57:30.2099214Z       Start 66: simh-vax8200
2023-12-30T22:57:52.6225670Z 66/78 Test #66: simh-vax8200 .....................   Passed   22.41 sec
2023-12-30T22:57:52.6228965Z       Start 67: simh-vax8600
2023-12-30T22:58:08.5845251Z 67/78 Test #67: simh-vax8600 .....................   Passed   15.96 sec

@markpizz
Copy link
Contributor

markpizz commented Jan 4, 2024

With respect to defining SIM_ASYNCH_IO, the makefile globally defines that on the command line for ALL simulators in OS_CCDEFS when it determines thread support exists. SIM_ASYNCH_IO is not per-simulator.

It used to be, and it certainly isn't necessary and is not the case in the simh/simh repo OR in the Visual Studio Projects whenever Asynch I/O isn't appropriate. USE_READER_THREAD is only meaningful for simulators that actively use sim_ether. SIM_ASYNCH_IO is appropriate for simulators that use sim_ether or use RQ and TQ devices. All other simulators are completely single threaded.

As for your rigorous tests:

Most simulator testing merely validates the register definitions and does essentially nothing related to simulator execution or event processing. Additionally, the cases of more extended testing (VAX and others) are designed to rigorously exercise the instruction execution activities of the particular simulator and deliberately do nothing special to touch devices that may actually perform asynchronous I/O. None of these aggressively exercise anything related to event activities. I'll provide more precise evidence of the performance difference later today.

@markpizz
Copy link
Contributor

markpizz commented Jan 6, 2024

For the makefile, my recommendation would be to carefully merge the current simh/simh makefile, thereby picking up Mark's changes (unless Mark permits cherry picking all of the commits not marked "DISK".)

The ONLY things which can be cherry picked from simh/simh are changes made by others. ABSOLUTELY ANY CHANGE I have made IS OFF LIMITS.

@rcornwell
Copy link
Member

But changes made to open-simh are ok to be pulled into your repository?

@agn453
Copy link
Contributor

agn453 commented Jan 6, 2024

Just go ahead and make the changes to the open-simh makefile - as appropriate to the issues reported here.
I'd expect to see a similiarity to changes made elsewhere if they were fixing the same thing.

@markpizz
Copy link
Contributor

markpizz commented Jan 6, 2024

I occasionally (today included) submit changes directly to open-simh/simh. Some of my PRs are merged, some are not. The more that are not merged, the less likely I am to submit future ones, and the less likely I am to participate in discussions about code I wrote.

Therefore the approximately 200 commits I've made since the open-simh fork are no longer part of open-simh. These include fixing bugs, adding functionality and generally extending some simulators and the platform in general.

@rcornwell:

But changes made to open-simh are ok to be pulled into your repository?

Different licenses,, different rules. You guys motivated the different rules based on the idea that my solution to a user problem wasn't good enough and required a different solution. This varied from the historical approach always taken in the simh project for the prior 20 years where the author of possibly problematic code is primarily responsible for fixes.

I submitted today's PR due to concern about the problems raised by this PR, which a later message will provide details about.

@markpizz
Copy link
Contributor

markpizz commented Jan 6, 2024

I'm starting to wonder why SYM_ASYNC_IO should be around in the first place.

Any simulated disk or tape device could be written to provide asynchronous operations. Writing with that in mind requires restructuring the traditional simulated device to take advantage of that approach.

When Sergey Oboguev implemented his multi-processor (32 processor) VAX simulator back in 2012 he started from the then current simh code which at that time included asynch RQ, TQ and ethernet devices. Pausing for I/O operations "in between instructions" would have significantly impacted the behaviors of the simulated processors.

Meanwhile, the only simulators which might need SIM_ASYNCH_IO are ones with Ethernet and then it only exists to avoid polling for input which then lets things be well behaved while idling.

The cmake build infrastructure should only build with SIM_ASYNCH_IO for simulators that need it.

Evidence:

Testing event queueing performance with Asynchronous I/O is not available
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 7.477155 seconds (System: 0 seconds, User: 7.477155 seconds)

Same host using simh/simh repo with SIM_ASYNCH_IO defined and DONT_USE_AIO_INTRINSICS defined, thus avoiding any theoretical issues with InterlockedCompareExchangePointer

Testing event queueing performance with Lock based asynchronous event queue
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 37.386039 seconds (System: 10.478 msecs, User: 37.375561 seconds)

Same host using simh/simh repo with SIM_ASYNCH_IO defined, not avoiding anything with InterlockedCompareExchangePointer.

Testing event queueing performance with Lock free asynchronous event queue
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 37.057254 seconds (System: 614 usecs, User: 37.05664 seconds)

All 3 cases had the same events scheduled and completed. This test represents a basic simulator that has no Ethernet devices. In these cases, it seems that a 5X advantage in the SCP event management APIs to compiling without SIM_ASYNC_IO defined.

On Windows compiled with VS2008 (Debug build), we get:

Testing event queueing performance with Asynchronous I/O is not available
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 12.03125 seconds (System: 0 seconds, User: 12.03125 seconds)
Testing event queueing performance with Lock free asynchronous event queue
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 51.8125 seconds (System: 62.5 msecs, User: 51.75 seconds)
Testing event queueing performance with Lock based asynchronous event queue
        simh queueing performance without asynchronous I/O activity
After 960,000,160 cycles, 100M events have been processed.
CPU Time: 1:13.15625 minutes (System: 265.625 msecs, User: 1:12.890625 minutes)

This is a 4.3X advantage in the SCP event management APIs to compiling without SIM_ASYNC_IO defined.

Apart from this basic processing advantage when there are no asynchronous threads in the picture, I've got another test which very aggressively exercises the asynchronous event system with a varying number of threads. Processing 100 Million asynchronous events. This exercise runs equivalently with BOTH the locked based asynchronous system as well as the Lock Free system. Performance is better with the lock free case. Equivalently, means that all thread activity completes as expected, but due to dynamic scheduling of threads by the OS, the order changes slightly - this is fully expected.

The exact same results are observed on: Windows Intel, WSL Intel, Ubuntu Intel, macOS Intel, macOS M2, Raspberry Pi.

BTW, the VAX diagnostic tests aggressively exercise the instruction execution, but actually only perform at most several dozen asynchronous events reading the diagnostics into memory. Not much of a test of the asynchronous event system.

Given that the above mentioned aggressive testing has not uncovered any actual problems to be solved, I suggest that before this PR is considered further, a demonstrable failure should be provided that the changes actually fix.

@pkoning2:
BTW, there was an early effort to provide asynch I/O for TMXR, but that never got tested and its value disappeared completely when I added precise port speed timing.

@rcornwell
Copy link
Member

Your fix for AIO seems to be to remove it. @bscottm fix appears to solve the problems with it.

@pkoning2
Copy link
Member

pkoning2 commented Jan 6, 2024

...
For the makefile, my recommendation would be to carefully merge the current simh/simh makefile, thereby picking up Mark's changes (unless Mark permits cherry picking all of the commits not marked "DISK".)

No; the rule is no bringing over stuff from simh/simh because of the mismatching license. What does work is for the author of some change (who can license code at will as the owner) to contribute stuff to open-simh whether it was also submitted to simh/simh or not.

@pkoning2
Copy link
Member

pkoning2 commented Jan 6, 2024

I occasionally (today included) submit changes directly to open-simh/simh. Some of my PRs are merged, some are not.

As a rule they get merged, but there may be delays to to busy work schedules.

@bscottm
Copy link
Contributor Author

bscottm commented Jan 6, 2024

@markpizz: The queue testing code isn't in open-simh/simh, so there's no way to reproduce the results.

@pkoning2: I'll make it a simulator option in both the makefile and CMake.

@markpizz
Copy link
Contributor

markpizz commented Jan 6, 2024

@markpizz: The queue testing code isn't in open-simh/simh, so there's no way to reproduce the results.

You opened this PR specifically with the assertion that there was a problem that needed fixing in simh asynchronous event queueing without any proof of that assertion. You should be able to write your own test case which demonstrates the problem you are actually fixing. The concept of how to write a demonstration which exercised the original code and produces a failure should be simple for you given the language you use in the PR code comments which claims various specific failures and deep knowledge of how things work. Using my code, which proves there are no problems, doesn't solve your requirement to demonstrate any actual failure case.

The theoretical example cases that have been suggested aren't actually reproducible after many dozens of tries on various platforms. The assertion that, these problems don't happen when run with the PR code, doesn't really prove anything since they aren't producible beforehand and even if these rare problems actually occur the timing of the code changes in the different implementations could readily change the timing between threads and that changed timing could affect the true cause of the problem which I now am sure isn't related to asynchronous simh queueing.

The cases that were "rumored" to be fixed by the PR:
B5500, PDP10-KL, IBM1130, SEL32 and VAX haven't been tracked down to thread event queueing. The B5500 and IBM1130 don't have any muti-thread code, and therefore should produce the same failures with with or without asynchronous I./O being compiled in. The IBM1130 had intermittent hanging problems with windows CI/CD builds long ago when Lars had Windows CI/CD on Appveyor. The Windows build of the IBM1130 never ran with SIM_ASYNC_IO enabled. If a detailed failing case is provided for the B5500, I'll be glad to track it down, but I'm 100% sure it has nothing to do with SIM_ASYNC_IO.

The only bug I found was the fact that SET NOASYNC didn't disable asynchronous event queueing for devices using sim_ether. I've provided a fix for that, This doesn't fix anything actually related to this code in this PR, but it was discovered while exploring the theoretical problem space (as was the missing SIM_ASYNC_IO in the open-simh/simh makefile).

@bscottm
Copy link
Contributor Author

bscottm commented Jan 7, 2024

@markpizz: The queue testing code isn't in open-simh/simh, so there's no way to reproduce the results.

You opened this PR specifically with the assertion that there was a problem that needed fixing in simh asynchronous event queueing without any proof of that assertion.

I once asked a relatively simple question, "Is sim_asynch_check = 0; at scp.c:439 necessary?" Since I never got a clear answer to that, I assumed that forcing the AIO check on the next call of AIO_EVENT_CHECK was necessary. While staring at the current implementation, I came across issues that I've encountered before, and taught undergraduates and cautioned defense contractors to avoid.

If one is going to use threads, one has to take into consideration read/write consistency across cores. Using mutexes does not automatically ensure consistency; it only ensures sequential ordering between threads. Consistency is an issue on ARM, which underlies the Apple M1 processors. So, one has to pay attention to this problem if forcing the AIO check is necessary.

Windows' InterlockedCompareExchangePointer is understood to be expensive. I can't speak for VS 2008 because I haven't looked at its generated assembly code. I can speak for VS 2019, VS 2022: InterlockedCompareExchangePointer lowers to one instruction -- LOCK CMPXCHG. That's all it is. GCC and CLang do the same thing for __atomic_compare_exchange == LOCK CMPXCHG. Also notable is that AIO_QUEUE_VAL() != QUEUE_LIST_END lowers to a single CMPL (compare longword) on Intel GCC and CLang, which is why those rough performance numbers come out a little faster.

Windows does not have an InterlockedLoadPointer that counterparts GCC and CLang's __atomic_load. The workaround is to use InterlockedCompareExchangePointer with a compare that always fails; since InterlockedCompareExchangePointer always returns the pointer's original value, it's about as inexpensive as one can get to simulate an atomic load on Windows.

sim_instr is the critical performance path for simulators. That's why I went with a rough estimate for the synchronized (interlocked, atomic) access to sim_asynch_queue in AIO_CHECK_EVENT when running the vax730 diagnostics on Linux/Unixen:

for i in $(seq 1 50); do time BIN/vax730 VAX/tests/vax-diag_test.ini > /dev/null; done

This measures an amortized sim_asynch_queue penalty while executing a wide cross section of instructions. Actual I/O is relatively subsidiary to measuring the amortized penalty.

I could write a trivial test that exercises the NOP instruction path in a simulator's sim_instr function on Windows using a pessimistic 16-year-old compiler. That merely measures the impact of calling InterlockedCompareExchangePointer without amortizing across any instruction workload, which is fairly pointless.

@bscottm
Copy link
Contributor Author

bscottm commented Jan 13, 2024

@pkoning2 PR #328 updated to include applying AIO to the simulators that actually require it (i.e., simulators that call/use AIO_CHECK_EVENT. Made that change independent of this proposal.

@pkoning2
Copy link
Member

pkoning2 commented Feb 1, 2024

Could you handle the merge conflicts?

@markpizz
Copy link
Contributor

markpizz commented Feb 1, 2024

Please hold off on committing this PR.

The author hasn't actually provided anything which demonstrates that this PR fixes any problem.

He asks this question which has been repeatedly answered in the past as the motivation for the PR, due to some external threading analysis tool generating a false positive:

I once asked a relatively simple question, "Is sim_asynch_check = 0; at scp.c:439 necessary?" Since I never got a clear answer to that, I assumed that forcing the AIO check on the next call of AIO_EVENT_CHECK was necessary.

The answer, once again, is that the setting of that variable (and it's reference elsewhere) is deliberately not synchronized since it might change the delivery of an event into the simulator a little sooner that it might otherwise happen, but the cost of synchronizing that check on every potential reference would significantly slow down simulator execution by the overhead of the synchronization with no measurable gain in the behavior of the simulator.

Further explanation and details provided later today.

@markpizz
Copy link
Contributor

markpizz commented Feb 1, 2024

While staring at the current implementation, I came across issues that I've encountered before, and taught undergraduates and cautioned defense contractors to avoid.

It is interesting that, while teaching your undergraduates and interacting with defense contractors, you never suggested that they should have unit tests to exercise and prove problems exist. Maybe you suggested it for them, but you're absolved of any such utilities and can just blow ahead and rewrite things which aren't actually broken.

If one is going to use threads, one has to take into consideration read/write consistency across cores. Using mutexes does not automatically ensure consistency; it only ensures sequential ordering between threads. Consistency is an issue on ARM, which underlies the Apple M1 processors. So, one has to pay attention to this problem if forcing the AIO check is necessary.

That is very wise, and in fact precisely what was done.

Windows' InterlockedCompareExchangePointer is understood to be expensive. I can't speak for VS 2008 because I haven't looked at its generated assembly code. I can speak for VS 2019, VS 2022: InterlockedCompareExchangePointer lowers to one instruction -- LOCK CMPXCHG. That's all it is. GCC and CLang do the same thing for __atomic_compare_exchange == LOCK CMPXCHG. Also notable is that AIO_QUEUE_VAL() != QUEUE_LIST_END lowers to a single CMPL (compare longword) on Intel GCC and CLang, which is why those rough performance numbers come out a little faster.

When building with VS2022, the event APIs are actually faster than the VS2008 compiler, which is great AND it proves that the newer compiler improves things without breaking code that has always worked.

Windows does not have an InterlockedLoadPointer that counterparts GCC and CLang's __atomic_load. The workaround is to use InterlockedCompareExchangePointer with a compare that always fails; since InterlockedCompareExchangePointer always returns the pointer's original value, it's about as inexpensive as one can get to simulate an atomic load on Windows.

sim_instr is the critical performance path for simulators. That's why I went with a rough estimate for the synchronized (interlocked, atomic) access to sim_asynch_queue in AIO_CHECK_EVENT when running the vax730 diagnostics on Linux/Unixen:

for i in $(seq 1 50); do time BIN/vax730 VAX/tests/vax-diag_test.ini > /dev/null; done

This measures an amortized sim_asynch_queue penalty while executing a wide cross section of instructions. Actual I/O is relatively subsidiary to measuring the amortized penalty.

Your exercise of the changed code using the VAX730 simulator which reasonably exercises the range of simulator's ability to execute all of its instruction with some vigor, however, the whole simulator test exercises some 143k lines of code across some 148 files. During this vigorous exercise of the simulator, it runs for almost 9 minutes of simulator elapsed time and executes some 350 million instructions and interacts with the SCP event system only some 125K events with a grand total of 53 asynchronous events. This relatively small number of events and essentially no asynchronous events hardly would be considered by anyone as a test of the code you've changed in this PR. Repeating that test really doesn't add any value to exercise the changed code. The point of exercising the event system and more specifically the asynchronous event code should be to assure that the code you're changing is robust to cover all theoretical multi-threaded interactions. The total asynchronous event processing code is some 200ish lines of code. A reasonable unit test of these 200+ lines of code that 1) demonstrates the failure of the prior code, and 2) demonstrates the proper behavior of the changed code would be required by most folks in order to adopt this changed code.

I could write a trivial test that exercises the NOP instruction path in a simulator's sim_instr function on Windows using a pessimistic 16-year-old compiler. That merely measures the impact of calling InterlockedCompareExchangePointer without amortizing across any instruction workload, which is fairly pointless.

The examples measurements I provided earlier proved that simulators which never leverage the SCP asynchronous APIs were burdened with significant extra cost during event processing, and therefore should be built without USE_ASYNCH_IO defined.

The real test and point of your changes is the behavior of how multiple threads behave when the SCP asynchronous event APIs are rigorously engaged including all the potential ways and timing that separate threads can interact. Such a test should demonstrate failure when run with the unmodified asynchronous even APIs and success with modified code. These results should demonstrate success results on all platforms that SIMH supports.

Getting back to the earlier question about whether "sim_asynch_check = 0" is needed , as I've said earlier, "it would be nice" if it was noticed every time it is done, but as demonstrated by the full vax730 test, asynchronous events happen very rarely relative to other event interactions and instruction execution, so adding overhead to assure that the asynchronous event is always delivered as soon as possible vs slipping a few instruction times is not worth any extra overhead to assure that.

Some of your comments in the earlier commits suggested things about acquire and release semantics. These might be relevant on some platforms, but the total changes required to achieve this comprises 2 lines removed and 4 lines added to the existing code which results in code which reliably works on all current platforms AS WELL as IA64 VMS which is then a distinct improvement for that rare platform.

At the beginning of this PR, you asserted in the commit comments and comments in the code is that the existing code causes synchronization problems on various simulators without anything that actually proved the code being changed here was actually due to the observed problems. Now, after my earlier comments, you remove those assertions, so now you've got changes that are merely "gee if I had to solve this problem, I'd do it this way". This is another of your "gee if" code changes that don't actually fix anything that is broken. The Open SIMH Steering Group is welcome to take any and all such changes rather than merely focusing on adopting changes which actually fix problems or enhance functionality.

@bscottm
Copy link
Contributor Author

bscottm commented Feb 8, 2024

Fixing a bug is not a predicate to simplifying complex code. The LLVM thread sanitizer brought an issue to light that led to an opportunity to simplify the AIO thread code. LLVM isn't some random tool: it's the compiler toolchain for macOS and widely used. Were it the case that the sanitizers produced consistent false positives, both the Google and the LLVM teams would have addressed them.

Modifying sim_asynch_check in two separate threads without atomic access or within the bounds of a mutex is a defect. That the lock-free code path works is not by explicit intent, i.e., explicit atomic access to ensure cross-core memory consistency. This patchset restructures the code so that the intent is very explicit and no assumptions need to be made with respect to the underlying processor's memory consistency semantics.

Asymmetry: There is one #ifdef forest that goes down the path for lock-free and another for mutex-based exclusion regions. In the lock-free case, the sim_asynch_lock mutex is not recursive. In the mutex-based exclusion region case, it is recursive. Asymmetry makes code harder to read and maintain, unless there's clear documentation as to why that asymmetry exists. This patchset initializes sim_asynch_lock as a recursive mutex for both the lock-free and mutex-based code paths.

Lock-free: The original code surrounds the lock-free access to sim_asynch_queue with a sim_asynch_lock mutex acquire and release via AIO_I_LOCK and AIO_I_UNLOCK. Either use lock-free sim_asynch_queue access or use mutex-based exclusion; do not mix the two. If AIO_I_LOCK needs to acquire sim_asynch_lock, then there's an argument to be made that there is an underlying issue with lock-free access to sim_asynch_queue. This patchset removes the extra mutex acquire-release in the lock-free case.

Ensuring original code intent: The intent behind setting sim_asynch_check = 0 in the AIO thread is to signal the main SIMH thread to queue available AIO transactions on the UNITs for further processing. This intent is easily met just by checking whether sim_asynch_queue isn't empty since sim_asynch_queue is already guarded by atomic accesses or mutex exclusion regions. sim_asynch_check = 0 becomes unnecessary.

Inlines vs. #define code: Inline code ensures type safety and debuggers will step through the code to show the explicit execution flow. This patchset restructures the lock-free atomic accesses as inline code both for explicit execution flow and to isolate the #ifdef forests to a single place. Given that SIMH requires a C99-compliant compiler, this isn't a breaking change.

Mutex semantic overloading: sim_asynch_lock is used for two distinct purposes -- an exclusion region for AIO and to serialize debugging output in _sim_debug_write_flush. This patchset adds a separate mutex to serialize debugging output so that sim_asynch_lock is solely used for AIO and no other purpose.

Possible Heisenbug: @rcornwell observed that the the pdp10-kl simulator stalls after a period of time and pointed to the main SIMH thread waiting for a mutex. I've seen this behavior and it is hard to reproduce, which fits the description of a Heisenbug. I chose to simplify the code in an effort to track Richard's observations. Apparently, this patchset resolves the stall behavior.

@bscottm
Copy link
Contributor Author

bscottm commented Feb 8, 2024

Looking into why CMake 3.28.9 doesn't like "Visual Studio 17 2022". Works fine locally for me for CMake 3.28.8 but borks for 3.28.9. Ubuntu and macOS builds use an older version as well, so this appears to be an issue with the latest, 3.28.9.

@markpizz
Copy link
Contributor

@bscottm said:

Fixing a bug is not a predicate to simplifying complex code. The LLVM thread sanitizer brought an issue to light that led to an opportunity to simplify the AIO thread code. LLVM isn't some random tool: it's the compiler toolchain for macOS and widely used. Were it the case that the sanitizers produced consistent false positives, both the Google and the LLVM teams would have addressed them.

Your code is no less complex than the original code, in fact it merely moves things around mostly to create your needed inline setup. Of course, digging through things and moving things around got you familiar with it so now it doesn't seem complex.

Modifying sim_asynch_check in two separate threads without atomic access or within the bounds of a mutex is a defect. That the lock-free code path works is not by explicit intent, i.e., explicit atomic access to ensure cross-core memory consistency. This patchset restructures the code so that the intent is very explicit and no assumptions need to be made with respect to the underlying processor's memory consistency semantics.

You've removed the "other thread" modifying sim_asynch_check so with the code in this PR, the original role of setting sim_asynch_check to 0 isn't happening at all, nor does it matter since all references to sim_asynch_check happen in the main simulator thread. This is NOT what was intended.

Meanwhile, your amazing thread sanitizer tool somehow doesn't identify at least one other variable which is referenced without explicit synchronization in multiple threads.

Asymmetry: There is one #ifdef forest that goes down the path for lock-free and another for mutex-based exclusion regions. In the lock-free case, the sim_asynch_lock mutex is not recursive. In the mutex-based exclusion region case, it is recursive. Asymmetry makes code harder to read and maintain, unless there's clear documentation as to why that asymmetry exists. This patchset initializes sim_asynch_lock as a recursive mutex for both the lock-free and mutex-based code paths.

It might be harder to maintain if there was actually a need to maintain it since it had actual failure modes. You haven't demonstrated anything specific which actually fails. Certainly, if a problem actually existed it would need to be addressed by someone who correctly considered all of the original intended design goals.

Your solution to your PR #327 shows that you actually can create a unit test to exercise some internal code specifically, but you silently ignore that any such unit testing would be useful and provide any number of insights.

Your single test case which is absolutely not reproducible is no proof that there is a problem in the asynch event code. A rumor has been stated that a hang was sitting a mutex wait when examined with a debugger. You could readily take such a case and dig around the rest of the memory state to get a handle on the actual conditions which caused the hang. Knowing that you could then write a unit test which specifically created the conditions reliably so it wouldn't be hard to reproduce. Once you had such a test case, you could drop in different code for the asynch event queuing system and fix the problem. I believe that you won't actually find such a problem.

Lock-free: The original code surrounds the lock-free access to sim_asynch_queue with a sim_asynch_lock mutex acquire and release via AIO_I_LOCK and AIO_I_UNLOCK. Either use lock-free sim_asynch_queue access or use mutex-based exclusion; do not mix the two. If AIO_I_LOCK needs to acquire sim_asynch_lock, then there's an argument to be made that there is an underlying issue with lock-free access to sim_asynch_queue. This patchset removes the extra mutex acquire-release in the lock-free case.

I can't find any AIO_I_LOCK or AIO_I_UNLOCK. I do use AIO_ILOCK and AIO_IUNLOCK, which do something when a lock free solution isn't available (the mutex case), and otherwise are defined as nothing. This strategy (of #defines sometimes being blank) is precisely how all of the SIM_ASYNCH_IO enabled macros are only defined when needed. You don't seem to want to change that approach.

Ensuring original code intent: The intent behind setting sim_asynch_check = 0 in the AIO thread is to signal the main SIMH thread to queue available AIO transactions on the UNITs for further processing. This intent is easily met just by checking whether sim_asynch_queue isn't empty since sim_asynch_queue is already guarded by atomic accesses or mutex exclusion regions. sim_asynch_check = 0 becomes unnecessary.

That is certainly true and addresses the point I raised above. Extending that logic (testing only the sim_asynch_queue) one inch further removes the concept of having sim_asynch_check at all. MEANWHILE, as I've explained previously, referencing sim_asynch_queue (which sometimes requires a LOCK, or at least various memory flushes on some platforms) is more expensive than the unsynchronized value reference to sim_asynch_check. That needs to happen on every instruction execution in sim_instr(), while actual asynchronous events are very very rare in a running simulator and actually occur maybe once every 100's of thousands of instructions (or maybe millions of instructions). Avoiding the continuous overhead of mutex access or memory flushing makes sense to me, but not to you or your thread sanitizer.

Inlines vs. #define code: Inline code ensures type safety and debuggers will step through the code to show the explicit execution flow. This patchset restructures the lock-free atomic accesses as inline code both for explicit execution flow and to isolate the #ifdef forests to a single place. Given that SIMH requires a C99-compliant compiler, this isn't a breaking change.

#defines (if properly defined) also ensure precise type safety, but step through debugging is generally not available.

The "#ifdef forest" was already in a single place. Now, with your "Inline code", you need to actually have a compiler situation which in fact actually inlines this. Otherwise, you are burdened with procedure call overhead for every reference, you get automatic inline for free in both debug and release builds with the "#ifdef forest".

Mutex semantic overloading: sim_asynch_lock is used for two distinct purposes -- an exclusion region for AIO and to serialize debugging output in _sim_debug_write_flush. This patchset adds a separate mutex to serialize debugging output so that sim_asynch_lock is solely used for AIO and no other purpose.

We're back to "gee if I were to write this". Please demonstrate any functional problem with the existing code as well as the debug case.

Possible Heisenbug: @rcornwell observed that the the pdp10-kl simulator stalls after a period of time and pointed to the main SIMH thread waiting for a mutex. I've seen this behavior and it is hard to reproduce, which fits the description of a Heisenbug. I chose to simplify the code in an effort to track Richard's observations. Apparently, this patchset resolves the stall behavior.

It merely moves the bar to a different Heisenbug place which you haven't stumbled into yet. As I advised above, capture the hanging case and dig through the rest of the state to find the necessary conditions for the hang, then create a unit test which reliably recreates the hang conditions. FYI, several years back I dug through prior apparent hangs in at least one of @rcornwell's simulators and the problem was not in any asynchronous code, but an oversight in some aspect of the simulator/device code. The simulator code is large and complicated as is the rest of the SCP code which supports it. A unit test which specifically exercises the presumed failing routines (the asynch event queue management) will naturally remove all of this extra simulator and SCP code that somehow stumbled into the failing case.

@bscottm
Copy link
Contributor Author

bscottm commented Feb 14, 2024

@mark: I don't have the time tonight to respond fully. Please refrain from referring to the Google-produced, LLVM compiler team-maintained thread sanitizer as "my thread sanitizer."

  1. The memory, undefined behavior, address and thread sanitizers were originally written by Google to debug Chrome and their web services. It is currently maintained as an integral part of the LLVM compiler suite.

  2. These tools have been around for while and demonstrably useful. Demean and disparage them all you want, they're part of CI/CD pipelines that sanitize software people use every day.

And always remember the Roman admonition, "Memento mori."

@bscottm
Copy link
Contributor Author

bscottm commented Feb 25, 2024

@pkoning2: This patch set is a refactoring and cleanup of the AIO "lock-free" implementation that fixes a few issues that I've documented and for which I've provided rationale in earlier parts of the thread. It may have also unintentionally fixed a latent Heisenbug. The refactoring uses static inline functions instead of preprocessor macro code. In my experience, static inlines provide better argument hygiene and type safety, vice preprocessor macro code.

Fundamentally, I've striven to provide correct cross-thread communication based on my experience of writing, teaching and reviewing multicore threaded code. It's a matter of doing it intentionally and not mixing paradigms. Lock-free code, if implemented correctly, should not have to lock a mutex.

To @markpizz's point of waiting for synchronization primitives, his concern is overblown. On Intel, AIO_EVENT_CHECK's atomic load is just a regular MOV instruction. There's no penalty in the main thread's sim_instr loop on any of the major platforms, including Windows. A cut down version of the proposed code:

#include <stdio.h>


void *do_atomic_load(void * volatile *src)
{
  void *retval;
  __atomic_load(src, &retval, __ATOMIC_SEQ_CST);
  return retval;
}


int do_cmpxchg(void * volatile *dest, void *src, void *current)
{
    return __atomic_compare_exchange(dest, &current, &src, 0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
}

What does the assembly look like? GCC (left) and Clang (right) produce:

        .file   "zzz.c"                                         |               .text
        .text                                                   |               .file   "zzz.c"
        .p2align 4                                              |               .globl  do_atomic_load                  # -- Begin function do_atomic_load
        .globl  do_atomic_load                                  |               .p2align        4, 0x90
        .type   do_atomic_load, @function                       |               .type   do_atomic_load,@function
do_atomic_load:                                                 |       do_atomic_load:                         # @do_atomic_load
.LFB23:                                                         |               .cfi_startproc
        .cfi_startproc                                          |       # %bb.0:
        endbr64                                                 |               movq    (%rdi), %rax
        movq    (%rdi), %rax                                    |               retq
        ret                                                     |       .Lfunc_end0:
        .cfi_endproc                                            |               .size   do_atomic_load, .Lfunc_end0-do_atomic_load
.LFE23:                                                         |               .cfi_endproc
        .size   do_atomic_load, .-do_atomic_load                |                                               # -- End function
        .p2align 4                                              |               .globl  do_cmpxchg                      # -- Begin function do_cmpxchg
        .globl  do_cmpxchg                                      |               .p2align        4, 0x90
        .type   do_cmpxchg, @function                           |               .type   do_cmpxchg,@function
do_cmpxchg:                                                     |       do_cmpxchg:                             # @do_cmpxchg
.LFB24:                                                         |               .cfi_startproc
        .cfi_startproc                                          |       # %bb.0:
        endbr64                                                 |               movq    %rdx, %rax
        movq    %rdx, %rax                                      |               xorl    %ecx, %ecx
        lock cmpxchgq   %rsi, (%rdi)                            |               lock            cmpxchgq        %rsi, (%rdi)
        sete    %al                                             |               sete    %cl
        movzbl  %al, %eax                                       |               movl    %ecx, %eax
        ret                                                     |               retq
        .cfi_endproc                                            |       .Lfunc_end1:
.LFE24:                                                         |               .size   do_cmpxchg, .Lfunc_end1-do_cmpxchg
        .size   do_cmpxchg, .-do_cmpxchg                        |               .cfi_endproc
        .ident  "GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0"    |                                               # -- End function
        .section        .note.GNU-stack,"",@progbits            |               .ident  "Ubuntu clang version 14.0.0-1ubuntu1.1"
        .section        .note.gnu.property,"a"                  |               .section        ".note.GNU-stack","",@progbits
        .align 8                                                |               .addrsig
        .long   1f - 0f                                         |
        .long   4f - 1f                                         |
        .long   5                                               |
0:                                                              |
        .string "GNU"                                           |
1:                                                              |
        .align 8                                                |
        .long   0xc0000002                                      |
        .long   3f - 2f                                         |
2:                                                              |
        .long   0x3                                             |
3:                                                              |
        .align 8                                                |
4:                                                              |

See the movq under do_atomic_load? That's the __atomic_load. Nothing special on Intel. I've also special-cased that for Intel Windows with a specific comment. @markpizz's point has validity on non-Intel MS platforms and since I don't have access to non-Intel MS platforms, I've made pessimistic assumptions that can be relaxed.

For ARMv8 (AARCH64 and a proxy for Apple M1), GCC and Clang emit ldar instructions, which incurs a small overhead penalty because the load can't be reordered to before a stlr store barrier. If a compare-exchange occurred before the ldar, which would execute a corresponding stlr store instruction, you'd successfully pick up the value stored by the stlr. That overhead is minimal.

ARMv8 assembly output for the same C code as above:

        .arch armv8-a
        .file   "zzz.c"
        .text
        .align  2
        .p2align 4,,11
        .global do_atomic_load
        .type   do_atomic_load, %function
do_atomic_load:
.LFB11:
        .cfi_startproc
        ldar    x0, [x0]
        ret
        .cfi_endproc
.LFE11:
        .size   do_atomic_load, .-do_atomic_load
        .global __aarch64_cas8_acq_rel
        .align  2
        .p2align 4,,11
        .global do_cmpxchg
        .type   do_cmpxchg, %function
do_cmpxchg:
.LFB12:
        .cfi_startproc
        stp     x29, x30, [sp, -32]!
        .cfi_def_cfa_offset 32
        .cfi_offset 29, -32
        .cfi_offset 30, -24
        mov     x29, sp
        str     x19, [sp, 16]
        .cfi_offset 19, -16
        mov     x19, x2
        mov     x2, x0
        mov     x0, x19
        bl      __aarch64_cas8_acq_rel
        cmp     x0, x19
        ldr     x19, [sp, 16]
        cset    w0, eq
        ldp     x29, x30, [sp], 32
        .cfi_restore 30
        .cfi_restore 29
        .cfi_restore 19
        .cfi_def_cfa_offset 0
        ret
        .cfi_endproc
.LFE12:
        .size   do_cmpxchg, .-do_cmpxchg
        .ident  "GCC: (Debian 12.2.0-14) 12.2.0"
        .section        .note.GNU-stack,"",@progbits

The AIO_ILOCK and AIO_IUNLOCK macros are empty in the lock-free case. They're there to ensure mutual exclusion for lock-based code. Unless explicitly enabled, there are few platforms that still use the lock-based mutex code. The top 3 (macOS, Linux, Windows) are all lock-free implementations. I don't suggest dropping the lock-based code.

The claim that I've rearranged the complexity is true, to a limited extent. My aim was to make the code more readable and that should make it more maintainable in the long run. @markpizz obviously disagrees. It should be fairly easy to determine the preprocessor conditional flow for the lock-free, lock-based and no-AIO-at-all paths.

Heisenbugs: Hoping that one can find a way to consistently reproduce a Heisenbug is a fool's errand. That's why they're called "Heisenbugs". Perturbing the code is one of the few ways to get a Heisenbug to identify itself. It might be true that I've pushed the Heisenbug elsewhere or deeper. It might also be true that I've quashed a long term latent bug. Only time will tell.

The vigor with which @markpizz derides this code seems to be a perverse sort of admiration.

@bscottm
Copy link
Contributor Author

bscottm commented Feb 25, 2024

@pkoning2: Latest push eliminates the sim_asynch_check variable completely. All gone.

Zlib is no longer packaged as part of macOS, now needs to be installed
by HomeBrew.
- Periodic "python cmake/generate.py" to sync with makefile changes.

- Ensure that AIO_CCDEFS (makefile) and USES_AIO (CMake generate.py) are
  present for PDP-10 simulators. Otherwise, user gets a warning message
  when the simulator starts about network support and asynchronous I/O.
Address a Clang/LLVM sanitizer warning that scp_asynch_check is written
by both the AIO and main threads, one thread potentially clobbering the
other's value.

The "scp_asynch_check = 0" at scp.c:239 is where the thread sanitizer
detects the issue.  This assignment is supposed to cause the main thread
to process I/O updates on the next AIO_CHECK_EVENT code's execution. To
preserve that behavior, AIO_CHECK_EVENT now executes AIO_UPDATE_QUEUE
when sim_asynch_queue != QUEUE_HEAD, i.e., there is work to be done in
the queue.

Code refactoring:

- Convert preprocessor macro code to inline functions unless
  impractical.

- Eliminate the asymmetry between the lock-based (mutex) and lock-free
  implementations.

  - Lock-free: AIO_ILOCK/AIO_IUNLOCK do not reacquire sim_asynch_lock
    when compiler intrinsics are detected (GCC, Clang, MS C and DEC C on
    Itanium.)

  - Lock-based: If DONT_USE_AIO_INTRINSICS is defined or intrinsics are
    not detected, the AIO implementation becomes lock-based via mutexes
    and AIO_ILOCK/AIO_IUNLOCK recursively acquire/release sim_asynch_lock.

  - AIO defaults to the lock-based implementation if compiler intrinsics
    are not detected.

  - GCC, Clang: Prefer the __atomic intrinsics over the deprecated
    __sync intrinsics. The __sync intrinics still exist for older GCC
    compilers.

  - sim_asynch_lock is a recursive mutex for both lock-based and
    lock-free implementations. Eliminates implementation asymmetry.

- AIO_CHECK_EVENT invokes AIO_ILOCK and AIO_IUNLOCK so that the lock-based
  code cannot alter sim_asynch_queue when checking for pending I/O work.

- sim_debug_io_lock: Debug output serialization lock. sim_asynch_lock
  was semantically overloaded to serialize output from
  _sim_debug_write_flush. This lock provides better semantic clarity.

- Removed sim_asynch_check. It is no longer necessary.

- New builder script flag to disable AIO lock-free, force AIO lock-based code:

  - cmake-builder.ps1 -noaiointrinsics ...
  - cmake-builder.sh  -no-aio-intrinics ...

  "-DDONT_USE_AIO_INTRINSICS:Bool=On" is the command line CMake
  configuration option.
@markpizz
Copy link
Contributor

As I mentioned, I reacted to the message produced by your forced push to this PR which now does have 3 commits. I don't think it is appropriate to have the same commits present in multiple independent PRs.

@bscottm
Copy link
Contributor Author

bscottm commented Jun 24, 2024

@markpizz: If/when the other PR gets merged, the extra commits go away as they will be part of master. Which will mean a re-push to this branch, which is in no hurry to get merged anyway, so it's not really as big a deal as it seems. (And if you're looking at any of my other branches, you'll see the same thing, until the other PR gets merged.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants