Skip to content

rtapi_app: Deprecated autostart / add manual start / no autostop#4206

Open
hdiethelm wants to merge 5 commits into
LinuxCNC:masterfrom
hdiethelm:rtapi_no_autostart
Open

rtapi_app: Deprecated autostart / add manual start / no autostop#4206
hdiethelm wants to merge 5 commits into
LinuxCNC:masterfrom
hdiethelm:rtapi_no_autostart

Conversation

@hdiethelm

@hdiethelm hdiethelm commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Actual behavior:

  • rtapi_app starts on first loadrt command
  • rtapi_app exits on last rt component unload

This doesn't really allow to use hal.get_realtime_type() when there is no rt component loaded: #4205

This has other side effects, one example:

halcmd: debug 0 #rtapi_app starts, sets debug and immediately exits

Note: realtime scheduling unavailable (sched_setscheduler SCHED_FIFO: Operation not permitted).
  Process capabilities: cap_sys_nice=no cap_ipc_lock=no.
  Falling back to POSIX non-realtime.
  Fix: 'sudo make setcap' (preferred) or 'sudo make setuid' on rtapi_app.
  Override (testing only): set LINUXCNC_FORCE_REALTIME=1.
Note: Using POSIX non-realtime

halcmd: loadrt sum2 #rtapi_app starts again, not remembering the previous debug value

Note: realtime scheduling unavailable (sched_setscheduler SCHED_FIFO: Operation not permitted).
  Process capabilities: cap_sys_nice=no cap_ipc_lock=no.
  Falling back to POSIX non-realtime.
  Fix: 'sudo make setcap' (preferred) or 'sudo make setuid' on rtapi_app.
  Override (testing only): set LINUXCNC_FORCE_REALTIME=1.
Note: Using POSIX non-realtime
halcmd: loadrt sum2 #rtapi_app starts and stays running due to a realtime component is loaded

Note: realtime scheduling unavailable (sched_setscheduler SCHED_FIFO: Operation not permitted).
  Process capabilities: cap_sys_nice=no cap_ipc_lock=no.
  Falling back to POSIX non-realtime.
  Fix: 'sudo make setcap' (preferred) or 'sudo make setuid' on rtapi_app.
  Override (testing only): set LINUXCNC_FORCE_REALTIME=1.
Note: Using POSIX non-realtime

halcmd: debug 0

This PR changes the behavior to:

  • rtapi_app starts on realtime start
  • rtapi_app exits on realtime stop
  • rtapi_app autostart on load, unload, newinst, debug to not break existing setups but generates a warning that it should be started first.
halcmd loadrt sum2
WARNING: Deprecated: No master found. Use "realtime start" to start one.
  A master is started automatically.
  If this appears while using halcmd: Use halrun instead.
  halcmd should only be used with an already running realtime environment.
  halrun creates a realtime environment and tears it down at exit.
Note: Using XENOMAI4 EVL realtime

This makes the behavior nearly identical to RTAPI where realtime start / realtime stop is needed and makes the behavior of rtapi_app more deterministic.

Autostart without autostop should be mostly fine but the behavior will still be slightly different:
Master:

halcmd loadrt sum2
Note: Using XENOMAI4 EVL realtime
halcmd unload sum2
latency-test #Works
Note: Using XENOMAI4 EVL realtime

PR:

halcmd loadrt sum2
WARNING: Deprecated: No master found. Use "realtime start" to start one.
  A master is started automatically.
  If this appears while using halcmd: Use halrun instead.
  halcmd should only be used with an already running realtime environment.
  halrun creates a realtime environment and tears it down at exit.
Note: Using XENOMAI4 EVL realtime
halcmd unload sum2
latency-test 
halrun: Realtime already running.  Use 'halrun -U' to stop existing realtime session.
halrun -U
latency-test #Works
Note: Using XENOMAI4 EVL realtime

TBD

  • Will this create potential issues? Should be fine with autostart.
  • Check why I had to change the expected results for one test: The owner increased by two. This is due to the order change.
  • Remove the fork() inside halcmd, this is not needed any more This is not needed: halcmd has to do fork/exec anyway. It waits for the component to be ready or the app exiting. What ever happens first. This is fine due to rtapi_app returns only when load has finished. So the component being ready and exit happens more or less at the same time.

@BsAtHome

Copy link
Copy Markdown
Contributor

I'm not sure I like it that the raster test needs to call realtime start. It seems to be performed on the wrong level.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

I'm not sure I like it that the raster test needs to call realtime start. It seems to be performed on the wrong level.

Raster will most probably fail in RTAI the way it is implemented.
It uses: assert os.system('halcmd -f raster.hal') == 0, "raster.hal script failed" but it should probably use halrun.
Looks like a bad fix from my side, I have to look into it.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Hmm, assert os.system('halrun raster.hal') == 0, "raster.hal script failed" of course doesn't just work. Halrun loads the script and immediately kills it.

Is there a way to keep it running until the test is finished? Looks like the raster test does some non-standard things, that's why it breaks when I use manual start.

@BsAtHome

Copy link
Copy Markdown
Contributor

The real clue is that the test program builds a component that the raster.hal connects to and then starts the realtime from within. This is a legitimate construct as .hal files are just lines executed by halcmd. This construct is expected to work. Therefore, auto-start is a requirement, whereas auto-stop should not be.

(One very important thing is the line sets program 1000, which is not a value, but it initializes the HAL_PORT queue.)

@grandixximo

grandixximo commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@BsAtHome makes the call for me: if a userspace component plus a .hal that brings RT up on demand is a legitimate construct, then autostart is a requirement and my "move it up to halrun / the harness" suggestion is wrong. Scratch that part.

That actually points at a smaller fix than this PR: keep autostart, drop only autostop. That preserves the raster construct with zero test edits, fixes the "debug value lost, master restarts on every loadrt" example from the PR description since the master now persists, and keeps realtime_type valid once anything RT is loaded. Mechanically it is mostly the master_process_socket_command change to !force_exit, while leaving the autostart path in main() alone. #4205 is unaffected either way: no-master-ever stays the honest UNINITIALIZED.

The one review point that still stands regardless of which way you go: the hal-show +2 owner shift is worth root-causing before rebaselining. It is the raw rtapi module id, and the two extra ids look like the master (App()) and hal_lib now landing ahead of the user comps. Worth confirming it is deterministic, and ideally having hal-show print the owner by name so the test stops being coupled to RT startup allocation order.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

The real clue is that the test program builds a component that the raster.hal connects to and then starts the realtime from within. This is a legitimate construct as .hal files are just lines executed by halcmd. This construct is expected to work. Therefore, auto-start is a requirement, whereas auto-stop should not be.

(One very important thing is the line sets program 1000, which is not a value, but it initializes the HAL_PORT queue.)

It is not legal as long as RTAI exists. As predicted, this test fails with RTAI. RTAI has no autostart:

Running test: ../tests/raster
RTAPI: ERROR: could not open shared memory (No such file or directory)
HAL: ERROR: could not initialize RTAPI
Traceback (most recent call last):
error: Test failed: Traceback (most recent call last):
  File "/home/hannes/linuxcnc-src/tests/raster/./test", line 172, in main
    c = hal.component("test")
hal.error: Invalid argument

  File "/home/hannes/linuxcnc-src/tests/raster/./test", line 259, in <module>
    exit(main())
         ~~~~^^
  File "/home/hannes/linuxcnc-src/tests/raster/./test", line 253, in main
    c.exit()
    ^
UnboundLocalError: cannot access local variable 'c' where it is not associated with a value
*** ../tests/raster: XFAIL: test run exited with 1
Runtest: 1 tests run, 0 successful, 1 failed + 0 expected, 0 skipped, 0 shmem errors
Failed: 
../tests/raster

With this PR, the test still fails but with a different reason:

Running test: ../tests/raster
ERROR:  Can't remove RTAI modules, kill the following process(es) first
                     USER        PID ACCESS COMMAND
/dev/rtai_shm:       hannes    ....m python3
 32343error: Test failed: Traceback (most recent call last):
  File "/home/hannes/linuxcnc-src/tests/raster/./test", line 202, in main
    testInvalidOffset(prog, pin)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/hannes/linuxcnc-src/tests/raster/./test", line 80, in testInvalidOffset
    assert pin['fault_code'].value == FaultCodes.InvalidOffset.value
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

*** ../tests/raster: XFAIL: test run exited with 1
Runtest: 1 tests run, 0 successful, 1 failed + 0 expected, 0 skipped, 0 shmem errors
Failed: 
../tests/raster

@BsAtHome

Copy link
Copy Markdown
Contributor

Regarding auto-start... I'm not worried about the LCNC code base. I'm worried about external installations.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Regarding auto-start... I'm not worried about the LCNC code base. I'm worried about external installations.

That's a point. Most people probably us uspace, not RTAI, so they never noticed that their setup will not work with RTAI due to they use halcmd instead of halrun. Even the actual tests are already broken... ;-)

I have to think about how to solve this issue in a way that does not break the whole concept behind.

Main issue: If you have autostart, the chance is high you forget stop. However, this will probably show issues already due to when you don't do stop, halcmd will complain about already loaded components and you will use halrun -U to exit.

There is just the case left where you unload all rt components and expect rtapi_app to exit, which won't happen. I already added a warning when you use start while rtapi_app is already running, so it should show up.

@grandixximo

Copy link
Copy Markdown
Contributor

On the residual you flagged (unload all rt components and rtapi_app stays running): that is actually consistent, not a regression. On RTAI, unloading all components does not stop the RT base either, you realtime stop explicitly. So removing autostop while keeping autostart makes uspace match the RTAI model rather than diverge from it.

And the leak risk is bounded: halrun (halrun -U / realtime stop) and the linuxcnc script already stop on exit, so a lingering master only affects ad-hoc halcmd sessions, where the user already ends up using halrun -U as you noted. So I think keep autostart, drop autostop is the right narrow scope here.

@hdiethelm

hdiethelm commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

On the residual you flagged (unload all rt components and rtapi_app stays running): that is actually consistent, not a regression. On RTAI, unloading all components does not stop the RT base either, you realtime stop explicitly. So removing autostop while keeping autostart makes uspace match the RTAI model rather than diverge from it.

RTAI does also not do auto-start. Each linuxcnc application, including halrun has realtime start at the beginning due to this. However, one test was messed up and does not run in RTAI and with this PR also not anymore for uspace before I added realtime start to it.

Expecting that most people use uspace and some of them probably use halcmd without realtime start, being inconsistent here and might be drop a deprecation warning can be still a good thing to not brick running setups.

And the leak risk is bounded: halrun (halrun -U / realtime stop) and the linuxcnc script already stop on exit, so a lingering master only affects ad-hoc halcmd sessions, where the user already ends up using halrun -U as you noted. So I think keep auto-start, drop auto-stop is the right narrow scope here.

Agreed.

However, I don't like to re-add all this goto code and passing arguments to master that are executed immediately used for auto-start. So I might split master / client fully already in this PR and add an auto-start master functionality to the client. Let's see how this goes. Or also just create a start_master() function if splitting gets to cumbersome.

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch 2 times, most recently from b8b76f2 to 5fa7f0a Compare June 28, 2026 14:18
Comment thread tests/raster/test Outdated
#use interactive mode to be have the hal running
#while needed and exit with writing "exit" at the end
halrun = subprocess.Popen(["halrun", "-Is", "raster.hal"], stdin=subprocess.PIPE)
time.sleep(0.5) #Needs a short delay until halrun is up and running

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not nice. ToDo: Find an other way to check if halrun is up.

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch 2 times, most recently from 36269fe to 08465cd Compare June 28, 2026 17:24
@hdiethelm

Copy link
Copy Markdown
Contributor Author

So, the auto-start is back, but in a different way than before that supports easy spit to master / client later.

halcmd loadrt sum2
WARNING: Deprecated: No master found. Use realtime start to start one.
  A master is started automaticaly.
Note: Using XENOMAI4 EVL realtime

@grandixximo Do you know how I can wait until halrun = subprocess.Popen has halrun up and running (#4206 (comment)) I would have to wait until stdout shows %% on the command line to know it is ready, but halrun.stdout.readline() just blocks forever. Also if I add stdout=subprocess.PIPE and I don't see a way to add a timeout.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Version with master/client as a separate exe:
https://github.com/hdiethelm/linuxcnc-fork/tree/rtapi_no_autostart_master_client

It works but looks like more effort to get it finalized and could create more issues in the future.

There is still some duplicated code which has to be moved to uspace_rtapi_common.cc and a library is also missing, just a POC.

So I would prefer go forward the single exe way and split it in a future PR.

@grandixximo

Copy link
Copy Markdown
Contributor

I think the %% may never come over a pipe: the prompt looks gated on isatty(stdin) (halcmd_main.c:238), so with Popen stdin not a tty, halcmd prints no prompt and readline() would block forever. That might be why it hangs.

Maybe it is easier to sync through HAL than stdout here, since the test is already a HAL peer? Something like replacing the sleep(0.5) with a poll:

halrun = subprocess.Popen(["halrun", "-Is", "raster.hal"], stdin=subprocess.PIPE)
deadline = time.time() + 10
while time.time() < deadline:
    if hal.component_exists("raster") and hal.pin_has_writer("test.output"):
        break
    time.sleep(0.01)
else:
    raise RuntimeError("raster.hal did not come up within 10s")

pin_has_writer("test.output") should go true once raster.hal runs net output ... => test.output, and start is right after, so it would be deterministic and timeout-bounded without needing a pty. What do you think, would that work for your case?

Two small things I noticed in the teardown: except TimeoutExpired looks like it should be subprocess.TimeoutExpired (otherwise a NameError), and input=b"exit" probably wants to be b"exit\n" so halcmd's fgets returns before EOF.

@BsAtHome

Copy link
Copy Markdown
Contributor

A different solution:

Why not encapsulate the test in a test.hal file:

loadusr -w ./rastertest.py

and rename test into rastertest.py. You also need to remove the os.system('halrun -U') at the bottom because that would interfere with the run_tests' use of halrun.

Hal test files are run using halrun -f test.hal. That should start/stop RT appropriately and not require any hacked external process sequencing in the test.

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from 08465cd to 3e1db68 Compare June 29, 2026 19:48
@hdiethelm

Copy link
Copy Markdown
Contributor Author

I think the %% may never come over a pipe: the prompt looks gated on isatty(stdin) (halcmd_main.c:238), so with Popen stdin not a tty, halcmd prints no prompt and readline() would block forever. That might be why it hangs.

Hmm, an other "just do it automatically" that creates confusion. Thanks, I would never have expected this.
cat | halcmd -kfs -> no output
halcmd -kfs -> %%

Maybe it is easier to sync through HAL than stdout here, since the test is already a HAL peer? Something like replacing the sleep(0.5) with a poll:

halrun = subprocess.Popen(["halrun", "-Is", "raster.hal"], stdin=subprocess.PIPE)
deadline = time.time() + 10
while time.time() < deadline:
    if hal.component_exists("raster") and hal.pin_has_writer("test.output"):
        break
    time.sleep(0.01)
else:
    raise RuntimeError("raster.hal did not come up within 10s")

pin_has_writer("test.output") should go true once raster.hal runs net output ... => test.output, and start is right after, so it would be deterministic and timeout-bounded without needing a pty. What do you think, would that work for your case?

Two small things I noticed in the teardown: except TimeoutExpired looks like it should be subprocess.TimeoutExpired (otherwise a NameError), and input=b"exit" probably wants to be b"exit\n" so halcmd's fgets returns before EOF.

Thanks, this looks better. Still somewhat a workaround but a better one than just wait. I only test hal.pin_has_writer("test.fraction"), this should be enought, it is the last pin in the hal file.

A different solution:

loadusr -w ./rastertest.py

This won't work, at least not straight away: raster creates some pins before loading the hal file. These pins are needed in the hal file.

Now looking at the halrun script: Might be it could be an idea to create a proper python variant for this, it is used many times, either with halrun / halcmd in probably all possible variants.

tests/raster/test:        halrun = subprocess.Popen(["halrun", "-Is", "raster.hal"], stdin=subprocess.PIPE)
src/emc/usr_intf/stepconf/stepconf.py:        self.halrun = halrun = os.popen("halrun -Is", "w")
src/emc/usr_intf/stepconf/stepconf.py:        self.halrun = halrun = os.popen("halrun -Is", "w")
src/emc/usr_intf/pncconf/pncconf.py:            halrun = os.popen("halrun -Is > /dev/null", "w")
src/emc/usr_intf/pncconf/pncconf.py:        halrun = subprocess.Popen("halrun -I", shell=True,stdin=subprocess.PIPE,stdout=subprocess.PIPE, encoding='utf-8' )
src/emc/usr_intf/pncconf/tests.py:        halrun = os.popen("halrun -Is > /dev/null", "w")
src/emc/usr_intf/pncconf/tests.py:        self.halrun = halrun = os.popen("halrun -Is > /dev/null", "w")
src/emc/usr_intf/pncconf/tests.py:        self.halrun = halrun = os.popen("halrun -Is > /dev/null", "w")
src/emc/usr_intf/pncconf/tests.py:        self.halrun = os.popen("halrun -Is > /dev/null", "w")
src/emc/usr_intf/pncconf/tests.py:            a = os.popen("halrun -U > /dev/null", "w")
src/emc/usr_intf/pncconf/tests.py:            a = os.popen("halrun -U > /dev/null", "w")
...
src/hal/user_comps/gladevcp.py:            
            cmd = ["haltcl", opts.halfile]
        else:
            cmd = ["halcmd", "-f", opts.halfile] 
        res = subprocess.call(cmd, stdout=sys.stdout, stderr=sys.stderr)

src/emc/usr_intf/gscreen/gscreen.py:                res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-i", inifile, "-f", f])
src/emc/usr_intf/touchy/touchy.py:            res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-f", "touchy.hal"])
src/emc/usr_intf/touchy/touchy.py:                    res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-i", inifile, "-f", f])
src/emc/usr_intf/axis/scripts/axis.py:        cmd = "halcmd loadusr -Wn {0} gladevcp -c {0}".format(gladename).split()
src/emc/usr_intf/axis/scripts/axis.py:                    res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-i", vars.emcini.get(), "-f", f])
src/emc/usr_intf/axis/scripts/axis.py:                res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd"] + f.split())
src/emc/usr_intf/mdro/mdro.py:                halcmd_args = ["halcmd", "-V", "-i", args.ini, "-f", f]
src/emc/usr_intf/mdro/mdro.py:                print("halcmd", halcmd_args)
src/emc/usr_intf/mdro/mdro.py:                halcmd_args = ["halcmd", "-i", args.ini, "-f", f]
src/emc/usr_intf/mdro/mdro.py:        res = os.spawnvp(os.P_WAIT, "halcmd", halcmd_args)
src/emc/usr_intf/gmoccapy/gmoccapy.py:                res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-i", inifile, "-f", f])
src/emc/usr_intf/gmoccapy/gmoccapy.py:                res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd"] + f.split())
src/emc/usr_intf/qtvcp/qtvcp.py:                cmd = ["halcmd", "-f", opts.halfile]
src/emc/usr_intf/qtvcp/qtvcp.py:                    res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd", "-i",self.inipath,"-f", f])
src/emc/usr_intf/qtvcp/qtvcp.py:                res = os.spawnvp(os.P_WAIT, "halcmd", ["halcmd"] + f.split())
...

Using something like:

run=hal.halrun()
[ret, out] = run.load(`raster.hal')
[ret, out] = run.cmd('show')
run.exit()

would look much nicer and also be deterministic. The variants above probably do the commands later and no proper way to know when. Just an idea, for sure not for this PR.

@BsAtHome

Copy link
Copy Markdown
Contributor

A different solution:

loadusr -w ./rastertest.py

This won't work, at least not straight away: raster creates some pins before loading the hal file. These pins are needed in the hal file.

But that is not what is happening. The run_tests will execute halrun -f test.hal, which will start RT and then start the rastertest.py script through the loadusr command and wait (-w) for it to finish. Thus, everything is as expected. Sequence of events:

  1. run_tests --> halrun -f test.hal
  2. halrun --> realtime start
  3. halrun --> halcmd test.hal
  4. halcmd --> loadusr -w ./rastertest.py (halcmd waits for it to finish because of -w)
  5. rastertest.py --> build component + pins
  6. rastertest.py --> runs halcmd raster.hal (build nets)
    7.rastertest.py --> runs test
  7. halcmd --> exits now loadusr is done
  8. halrun --> halcmd stop
  9. halrun --> halcmd unload all
  10. halrun --> realtime stop
  11. exit

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Thanks, now I got it. Feels also a bit clumsy, use test.hal to only start rastertest.py and inside load the real raster.hal. Pushed.

@BsAtHome

Copy link
Copy Markdown
Contributor

Thanks, now I got it. Feels also a bit clumsy, use test.hal to only start rastertest.py and inside load the real raster.hal. Pushed.

You may think it is "clumsy", but it beats the popen hell every morning, day, evening and night.

FWIW, it is intentional how the tests work using a .hal test file. The exact same construct is used in tests/hal-stream and many tests use similar concepts and constructs. The point is that starting/stopping RT is the run_tests script's responsibility and not the test code.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

You may think it is "clumsy", but it beats the popen hell every morning, day, evening and night.

FWIW, it is intentional how the tests work using a .hal test file. The exact same construct is used in tests/hal-stream and many tests use similar concepts and constructs. The point is that starting/stopping RT is the run_tests script's responsibility and not the test code.

Agreed. I checked a few tests to find the proper way to do it but without success. Thanks!

BTW: tests/hal-stream has both test.py and test.hal. That is might be not a to good idea. It will brake if the order in runtests is changed. Just add for some reason *.py) at the beginning and it will break:

run_test() {
    testname="$1"
    case "$testname" in
        *.hal) run_without_overruns "$testname" ;;
        *.sh) run_shell_script "$testname" ;;
        *) run_executable "$testname" ;;
    esac
}

I traced the hal componet ID shift with:

diff --git a/src/rtapi/uspace_common.h b/src/rtapi/uspace_common.h
index 4032e7eb2f..7c5935eb06 100644
--- a/src/rtapi/uspace_common.h
+++ b/src/rtapi/uspace_common.h
@@ -329,11 +329,13 @@ int rtapi_init(const char *modname)
         id = uuid_data->uuid;
     rtapi_mutex_give(&uuid_data->mutex);
 
+    rtapi_print_msg(RTAPI_MSG_ERR, "New component %i %s\n", id, modname);
     return id;
 }
 
 int rtapi_exit(int module_id)
 {
+  rtapi_print_msg(RTAPI_MSG_ERR, "Exit component %i\n", module_id);
   rtapi_shmem_delete(uuid_mem_id, module_id);
   return 0;
 }
diff --git a/src/rtapi/uspace_rtapi_main.cc b/src/rtapi/uspace_rtapi_main.cc
index 24a89109e4..906c09b0be 100644
--- a/src/rtapi/uspace_rtapi_main.cc
+++ b/src/rtapi/uspace_rtapi_main.cc
@@ -858,6 +858,8 @@ static bool master_process_socket_command(int fd) {
 static pthread_t main_thread{};
 
 static int master(int fd) {
+    rtapi_print_msg(RTAPI_MSG_ERR, "Master Start\n");
+
     is_master = true;
     main_thread = pthread_self();
     int result;
@@ -876,6 +878,7 @@ static int master(int fd) {
     rtapi_msg_queue.consume_all([](const message_t &m) {
         fputs(m.msg, m.level == RTAPI_MSG_ALL ? stdout : stderr);
     });
+    rtapi_print_msg(RTAPI_MSG_ERR, "Master Exit\n");
     return result;
 }

and
./scripts/halrun ./tests/hal-show/test.hal > /dev/null 2> ~/component_rtapi_no_autostart.txt

It looks like the earlier start of master results in the shift of the ID's.
component_master.txt
component_rtapi_no_autostart.txt

The same amount of components are created and deleted. However, due to the order changes and the new component id is reset in master and not in rtapi_no_autostart, the number changes.

grep "New component" component_master.txt | wc -l
47
grep "Exit component" component_master.txt | wc -l
47
grep "New component" component_rtapi_no_autostart.txt | wc -l
47
grep "Exit component" component_rtapi_no_autostart.txt | wc -l
47

Now why the ID is reset as long as rtapi_app is not running but not anymore after is still a mystery to me.

@BsAtHome

Copy link
Copy Markdown
Contributor

Now why the ID is reset as long as rtapi_app is not running but not anymore after is still a mystery to me.

Probably because the HAL shared memory segment is removed in between. Every time you detach with hal_exit(), then a rtapi_exit and a rtapi_shmem_delete will be called if the reference count reached zero. When rtapi_app is running, the memory stays attached and the segment is never actually deleted. When you freshly create the segment, then everything is virgin and the counts start from scratch again.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Probably because the HAL shared memory segment is removed in between. Every time you detach with hal_exit(), then a rtapi_exit and a rtapi_shmem_delete will be called if the reference count reached zero. When rtapi_app is running, the memory stays attached and the segment is never actually deleted. When you freshly create the segment, then everything is virgin and the counts start from scratch again.

Thanks, now I found it. The code is this one that really removes the shm from the system:

if(r2 == 0 && d.shm_nattch == 0) {
r2 = shmctl(shmem->id, IPC_RMID, &d);

This uses the shm system reference counter to see if this shm UUID is used in any other running app and if not, removes it.

The reference counter

shmem->count --;
if(shmem->count) return 0;

is only used to check if this shm segment is used in the actual app. shmem_array itsself is not in shared memory, only shmem_array[].mem, so shmem_array[].count is not shared between apps.

This explains the behavior:

  • If rtapi_app is started first:
    • d.shm_nattch != 0 and the ID is continuously increased
  • If halrun is started first:
    • d.shm_nattch == 0 after the first halrun command, the ID is reset.
    • rtapi_app starts and the ID is now continuously increased

Additional trace:

diff --git a/src/rtapi/uspace_common.h b/src/rtapi/uspace_common.h
index 4032e7eb2f..043e85adae 100644
--- a/src/rtapi/uspace_common.h
+++ b/src/rtapi/uspace_common.h
@@ -201,7 +201,10 @@ int rtapi_shmem_delete(int handle, int module_id)
     return -EINVAL;
 
   shmem->count --;
-  if(shmem->count) return 0;
+  if(shmem->count){
+    rtapi_print_msg(RTAPI_MSG_ERR, "rtapi_shmem_delete id %i keep\n", module_id);
+    return 0;
+  }
 
   /* unmap the shared memory */
   r1 = shmdt(shmem->mem);
@@ -213,8 +216,11 @@ int rtapi_shmem_delete(int handle, int module_id)
 
   if(r2 == 0 && d.shm_nattch == 0) {
       r2 = shmctl(shmem->id, IPC_RMID, &d);
+      rtapi_print_msg(RTAPI_MSG_ERR, "rtapi_shmem_delete id %i delete d.shm_nattch = %li\n", module_id, d.shm_nattch );
       if (r2 != 0)
              rtapi_print_msg(RTAPI_MSG_ERR, "shmctl(%d, IPC_RMID, ...): %s\n", shmem->id, strerror(errno));
+  }else{
+      rtapi_print_msg(RTAPI_MSG_ERR, "rtapi_shmem_delete id %i keep system d.shm_nattch = %li\n", module_id, d.shm_nattch );
   }
 
   /* free the shmem structure */

rtapi_no_autostart:

Master Start
New component 1 HAL_LIB
Note: Using POSIX non-realtime
New component 2 HAL_LIB_1308925
New component 3 HAL_halcmd1308925
rtapi_shmem_delete id 2 keep system d.shm_nattch = 1
Exit component 2
rtapi_shmem_delete id 2 keep
Exit component 3
rtapi_shmem_delete id 3 keep system d.shm_nattch = 1
New component 4 HAL_LIB_1308925
New component 5 HAL_halcmd1308925
New component 6 HAL_threads
New component 7 HAL___thread1
rtapi_shmem_delete id 4 keep system d.shm_nattch = 1
Exit component 4
rtapi_shmem_delete id 4 keep
Exit component 5
rtapi_shmem_delete id 5 keep system d.shm_nattch = 1
New component 8 HAL_LIB_1308925
New component 9 HAL_halcmd1308925
New component 10 HAL_conv_bit_u32

master:

New component 1 HAL_LIB_1308184
New component 2 HAL_halcmd1308184
rtapi_shmem_delete id 1 delete d.shm_nattch = 0
Exit component 1
rtapi_shmem_delete id 1 keep
Exit component 2
rtapi_shmem_delete id 2 delete d.shm_nattch = 0
New component 1 HAL_LIB_1308184
New component 2 HAL_halcmd1308184
Master Start
New component 3 HAL_LIB
Note: Using POSIX non-realtime
New component 4 HAL_threads
New component 5 HAL___thread1
rtapi_shmem_delete id 1 keep system d.shm_nattch = 1
Exit component 1
rtapi_shmem_delete id 1 keep
Exit component 2
rtapi_shmem_delete id 2 keep system d.shm_nattch = 1
New component 6 HAL_LIB_1308184
New component 7 HAL_halcmd1308184
New component 8 HAL_conv_bit_u32

So I would say this is fine.

@hdiethelm

hdiethelm commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

I tweaked the warning a bit, realtime is normally not in the path. What do you think about this?

WARNING: Deprecated: No master found. Use "realtime start" to start one.
  A master is started automatically.
  If this appears while using halcmd: Use halrun instead.
  halcmd should only be used with an already running realtime environment.
  halrun creates a realtime environment and tears it down at exit.

@grandixximo

Copy link
Copy Markdown
Contributor

Already has only one L ;-)

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from ba82cf9 to 5260be5 Compare June 30, 2026 12:49
@hdiethelm

Copy link
Copy Markdown
Contributor Author

Already has only one L ;-)

Thanks. For that I even installed a spell checker in my IDE but somehow, it does not complain...

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from 5260be5 to ed1b44f Compare June 30, 2026 13:27
@hdiethelm hdiethelm changed the title WIP: rtapi_app no autostart rtapi_app: Deprecated autostart / add manual start / no autostop Jun 30, 2026
@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from ed1b44f to db1ef67 Compare June 30, 2026 18:01
@hdiethelm

Copy link
Copy Markdown
Contributor Author

So, the last TBD crossed out. No need to change anything in halcmd.

halcmd has to do fork/exec anyway. It waits for the component to be ready or the app exiting. What ever happens first. This is fine due to rtapi_app returns only when load has finished. So the component being ready and exit happens more or less at the same time, no race condition expected.

To test, I added a sleep in rtapi_app before the component got loaded or after. No issues for both variants.

BTW: There is a small visible change due to the removed autostop, I updated #4206 (comment)

Updated main description and title, ready from my side.

@hdiethelm hdiethelm marked this pull request as ready for review June 30, 2026 18:03
Comment thread scripts/realtime.in Outdated
@hdiethelm

Copy link
Copy Markdown
Contributor Author

I just formatted the whole thing. I hope I did not overdo it, otherwise I can also revert it and only do minimal change.

@hdiethelm

hdiethelm commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

...and due to I had the rtapi VM anyway up and running, I also fixed and tested the shellcheck issue in realtime.in

@hdiethelm

Copy link
Copy Markdown
Contributor Author

BTW: This will conflict with #4107. I suggest to merge #4107 first, then I rebase and then this can be merged to,

@grandixximo

Copy link
Copy Markdown
Contributor

Merged, so you can keep working on it ;-)

hdiethelm added 4 commits July 2, 2026 14:52
This gives rtapi_app a deterministic behavior. No other changes needed
except implementing start / stop in realtime due to all apps run:
realtime start at startup
realtime stop at exit

Autostart is still available with a deprecation message to not break
existing setups. But it is done in a different way by forking and then
sending the command over the socket. This allows for easy separation
into master / client in the future.
Due to rtapi_app is started before anything else, the hal-show expected
owner id is increased by two.

The raster test needs to be started with halrun to have realtime started
and stopped properly.
@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from 5d15326 to 57cf3f0 Compare July 2, 2026 13:25
@hdiethelm

hdiethelm commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Rebased and tested again.

Looks like i did not test latency-histogram. There, a realtime start was only active if RTAI is running, so it threw the deprecation warning. Fixed. @grandixximo can you check my change there? I'm not used to TCL but it seams to work: 57cf3f0

Edit:
Looking at my TCL change, I probably start realtime in the background. However, starting it without & only starts if I add > /dev/null which is also not what I want. Seams to be stdout is blocked for whatever reason. Any hints?

Starting in background has the potential for a race condition...

Edit2:
0d39dca should fix this and simplifies also your realtime check.

Any other standalone tools that need testing?

@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from 57cf3f0 to 0d39dca Compare July 2, 2026 16:24
@grandixximo

Copy link
Copy Markdown
Contributor

The >@stdout 2>@stderr fix looks right to me. I think the bare exec ... start hung because the forked master inherits stdout, so Tcl's capture pipe never sees EOF and waits until the master exits; redirecting to the real channels means Tcl is not reading a pipe, so it returns when realtime start exits. No &, no background race, IMO the cleaner way.

One thing I would double-check: start_master() does listen() then fork() and the parent returns right after, so the child creates the HAL shmem asynchronously. Socket commands are safe (they queue on the listening socket), but loading the Tcl Hal package right after attaches shmem directly, which I think can race the master creating it. Probably benign since init_hal_data() guards double-init, but the uspace path now has no wait where RTAI had after 1000. IMO it would be more robust to gate on a real HAL round-trip (e.g. halcmd show succeeding) before load Hal package, same deterministic idea as the raster poll, and it lets you drop the RTAI 1000ms guess too.

For other standalone tools: I think the ones that start RT on their own are stepconf, pncconf (halrun -Is) and latency-test, so those are worth a check. The GUIs go through the linuxcnc script which already brackets realtime start, and halscope/halmeter only attach to an existing HAL, so I would not expect issues there.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

The >@stdout 2>@stderr fix looks right to me. I think the bare exec ... start hung because the forked master inherits stdout, so Tcl's capture pipe never sees EOF and waits until the master exits; redirecting to the real channels means Tcl is not reading a pipe, so it returns when realtime start exits. No &, no background race, IMO the cleaner way.

One thing I would double-check: start_master() does listen() then fork() and the parent returns right after, so the child creates the HAL shmem asynchronously. Socket commands are safe (they queue on the listening socket), but loading the Tcl Hal package right after attaches shmem directly, which I think can race the master creating it. Probably benign since init_hal_data() guards double-init, but the uspace path now has no wait where RTAI had after 1000. IMO it would be more robust to gate on a real HAL round-trip (e.g. halcmd show succeeding) before load Hal package, same deterministic idea as the raster poll, and it lets you drop the RTAI 1000ms guess too.

Yes, exactly. The master can start slightly delayed after realtime start. Does TCL load the HAL library directly?

The paths that could race is:

  • hal_init() -> rtapi_init() -> rtapi_shmem_new()
  • hal_init() -> rtapi_init() -> uuid_data->uuid++
  • hal_init() -> rtapi_shmem_new()
  • hal_init() -> init_hal_data()

But as it looks to me, it should be safe. shmget() is hopefully race save inside the kernel. Everything else uses rtapi_mutex around access to the shared memory. Looks to me like who ever first reaches init_hal_data() after the mutex will do it. Everyone else won't.

Options:

  • Try to trigger a race and test if it is really safe
  • Add a ping command / reply so after start_master(). I could use run_slave_cmd() to send a ping to master to make sure it is up and running before returning. I guess this is the only way right now to wait until it is really up?

Every following rtapi_app call will take the slave path where a response is awaited anyway. So if there is no race in the above path's or the rtapi_app socket, it should be fine. rtapi_app uses bind() to check if the master is up. I think this should also be atomic, otherwise, we would have races all day long if two apps can bind the same socket... ;-)

I don't understand why realtime start is run in the background for the RTAI case. Might be same pipe issue and that is a quick&dirty workaround? ;-) Or a workaround for kernel modules taking some time to be up? I need to check what happens when I remove the wait.

What do you mean by "same deterministic idea as the raster poll"? The this change #4213 ?

halcmd show won't help: It also does hal_init(). If hal_init() races, bad luck. And it doesn't wait for the realtime part to be up, so it could still race after.

For other standalone tools: I think the ones that start RT on their own are stepconf, pncconf (halrun -Is) and latency-test, so those are worth a check. The GUIs go through the linuxcnc script which already brackets realtime start, and halscope/halmeter only attach to an existing HAL, so I would not expect issues there.

halrun -Is is fine: It starts rtapi_app. However, before or after my change, os.popen("halrun -Is", "w") is not deterministic. It will execute the start and the commands some time after the write() calls. That would be a reason to create a proper halrun python library, where after halrun.write(), you are sure that the command has been executed.

  • stepconf: Fine, tested. It starts but then doesn't find a parport (I don't have one)
  • pncconf: Needs a mesa card. I have one, however, it is in my machine and I don't want to brake it by accidentally setting a wrong pin.
  • latency-test: Fine, I always use that one for quick tests if nothing broke.

@grandixximo

Copy link
Copy Markdown
Contributor

You are right about halcmd show, my suggestion was wrong: it does its own hal_init() and does not round-trip the master, so it neither avoids the race nor proves the master is up. Scratch that.

And yes, the Tcl side loads the HAL library directly (hal_init() attaches shmem), which is exactly why the socket-queue safety does not cover it.

I think your ping idea is the right one, and IMO the clean place for it is inside start_master(): have the parent send a ping through run_slave_cmd() and only return once the child master answers. The master only reaches its accept loop after do_load_cmd("hal_lib") + App(), so a successful ping proves it is fully up. Then realtime start blocks until ready for every caller, not just latency-histogram, and you can drop the fixed waits. That fixes it at the source rather than each tool guessing.

On "same deterministic idea as the raster poll": I meant the general poll-until-ready approach, wait for an observable ready state instead of a fixed sleep, not #4213 specifically. The in-start_master ping is the same principle but better placed.

On the RTAI & + after 1000: I am not sure, my guess is it is kernel-module insmod settling rather than the pipe issue, and RTAI does not use the rtapi_app master socket, so the ping barrier would be uspace-only and RTAI's wait stays a separate concern. Worth checking what actually happens when you drop it, as you said.

Agreed on os.popen("halrun -Is", "w") being non-deterministic (commands run some time after the writes); a proper halrun python wrapper as you sketched earlier would solve that class cleanly. Good that stepconf and latency-test check out.

It was only active for RTAI.
Do realtime start in forground for RTAI.

Additionally, simplify realtime verify.
@hdiethelm hdiethelm force-pushed the rtapi_no_autostart branch from 0d39dca to 3a78036 Compare July 3, 2026 11:33
@hdiethelm

Copy link
Copy Markdown
Contributor Author

I think your ping idea is the right one, and IMO the clean place for it is inside start_master(): have the parent send a ping through run_slave_cmd() and only return once the child master answers. The master only reaches its accept loop after do_load_cmd("hal_lib") + App(), so a successful ping proves it is fully up. Then realtime start blocks until ready for every caller, not just latency-histogram, and you can drop the fixed waits. That fixes it at the source rather than each tool guessing.

Will do, deterministic is always nice... ;-)

On "same deterministic idea as the raster poll": I meant the general poll-until-ready approach, wait for an observable ready state instead of a fixed sleep, not #4213 specifically. The in-start_master ping is the same principle but better placed.

Can you point me at code doing this? Or would this be an other approach? I don't know of a way right now to check if the realtime is up. The grepping for a process named rtapi_app in realtime status also won't do the job.

On the RTAI & + after 1000: I am not sure, my guess is it is kernel-module insmod settling rather than the pipe issue, and RTAI does not use the rtapi_app master socket, so the ping barrier would be uspace-only and RTAI's wait stays a separate concern. Worth checking what actually happens when you drop it, as you said.

Removing the wait and:
Using & -> Error could not open shared memory
Using >@stdout 2>@stderr -> All works fine
Using nothing: -> All works fine

Not really knowing what the wait is for, I decided to do exec $::LH(realtime) start >@stdout 2>@stderr and leave the wait in. Pipe save and non breaking. realtime start in the background is for sure bad.

Agreed on os.popen("halrun -Is", "w") being non-deterministic (commands run some time after the writes); a proper halrun python wrapper as you sketched earlier would solve that class cleanly. Good that stepconf and latency-test check out.

There are probably more than a 100 locations using halrun / halcmd with pipes: #4206 (comment) I will create an issue do solve that properly if someone has the motivation to do it... ;-)

@grandixximo

Copy link
Copy Markdown
Contributor

The poll-until-ready code I had in mind is the raster snippet earlier in this thread: loop on hal.pin_has_writer("test.fraction") / hal.component_exists("raster") until true with a deadline. But that checks a specific pin/component, not "is realtime up" generically, so it does not directly answer your case.

For "is the master fully up" I think you are right that there is no clean existing check: both the process grep and the bind() test only tell you a master exists, not that it finished do_load_cmd("hal_lib") + App(). That gap is exactly what your ping closes. So IMO once the ping is in start_master(), callers need no separate readiness check at all, realtime start returning is the ready signal. That is better than any poll because there is nothing to race.

If you want a reusable "is RT up" beyond this, I think it would be nice to expose the same ping as a command (e.g. have realtime status round-trip a ping to the master instead of grepping). Then there is finally a deterministic RT-up check, but that can be its own follow-up, not needed for this PR.

On latency-histogram: >@stdout 2>@stderr and leaving the wait in sounds like the safe call, agreed the background & is bad. Once the start_master ping lands, the uspace wait becomes redundant and you could drop it then; the RTAI wait I would leave since we do not know it is only the pipe.

An issue for the ~100 halrun/halcmd pipe sites sounds right to me, that is clearly its own effort.

@BsAtHome

BsAtHome commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Could it be possibly be helpful to add a parameter realtime.running that defaults to false and is set/cleared by starting/stopping the realtime environment? If the param is missing, then RT is definitely not running. Else its value will state your case.

@hdiethelm

Copy link
Copy Markdown
Contributor Author

Could it be possibly be helpful to add a parameter realtime.running that defaults to false and is set/cleared by starting/stopping the realtime environment? If the param is missing, then RT is definitely not running. Else its value will state your case.

There is no pin/parameter without a component. I already implemented it and discarded it again here: #4132 (comment)

I would have to create a persistent component with a pin or parameter. You can not get rid of it. It will annoy tests and might be also people... ;-)

Now that you say it: hal.get_realtime_type() can be indirectly used to check if real time is running. REALTIME_TYPE_UNINITIALIZED means it is not running.

Still, I think just making sure that after realtime start, rtapi_app is initialized and up is the better solution. Should only delay a few ms at worst and doesn't need polling with timeout. There is already a timeout in rtapi_app.

If you do rtapi load comp, rtapi_app master will be auto started and be up and running afterwards. So only the start command needs a bit of change.

@hdiethelm

hdiethelm commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

On latency-histogram: >@stdout 2>@stderr and leaving the wait in sounds like the safe call, agreed the background & is bad. Once the start_master ping lands, the uspace wait becomes redundant and you could drop it then; the RTAI wait I would leave since we do not know it is only the pipe.

The only wait is for RTAI where I can not have a ping. RTAI doesn't use rtapi_app.

An issue for the ~100 halrun/halcmd pipe sites sounds right to me, that is clearly its own effort.

Yes, another long discussion first to have a nice api... ;-) #4230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants