Created on 2023-05-18.16:40:33 by falsifian, last changed 2023-10-19.22:35:46 by falsifian.
msg23268 (view) |
Author: falsifian |
Date: 2023-05-18.16:40:32 |
|
1. Summarise the issue (what were doing, what went wrong?)
Run "darcs log". Quit the pager (less) without scrolling to the end. less quits but darcs doesn't exit. CTRL-C doesn't work.
I could have sworn I reported this a few years ago, but I can't find the issue. http://bugs.darcs.net/issue1278 looks superficially similar, but I can't reproduce that one.
2. What behaviour were you expecting instead?
darcs should quit once less does.
3. What darcs version are you using? (Try: darcs --exact-version)
From OpenBSD ports; output of darcs --exact-version:
darcs compiled on May 8 2023, at 07:59:54
Weak Hash: not available
Context:
[TAG 2.16.5
Ganesh Sittampalam <ganesh@earth.li>**20220220083531
Ignore-this: 51417cb5dd488dc0d962ffb051f63ee1
]
4. What operating system are you running?
OpenBSD-current on amd64
|
msg23269 (view) |
Author: bfrk |
Date: 2023-05-25.09:46:30 |
|
This seems to ba OS specific, it doesn't happen on Linux.
The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer,
the actual process management is in pipeDocInternal where it says
rval <- waitForProcess pid
1) What is your pager and with what arguments is it run? (use ps
while the pager is still active). Depending on environment settings
and availability, darcs will invoke different pagers.
2) How do you quit the pager exactly?
3) Can you still see the pager with ps after it has been "quit",
i.e. is the process perhaps still running somehow?
The fact that Ctrl-C doesn't work even after quitting the pager
strongly suggests that the pager is still running, since we use the
delegate_ctlc feature of System.Process, that is, the parent process
(darcs) is ignoring SIGINT untill the child process is finished.
This is so that pagers which have no way to quit them other via
Ctrl-C can be quit without automatically quitting darcs itself, too.
|
msg23270 (view) |
Author: falsifian |
Date: 2023-05-25.21:30:38 |
|
On Thu, May 25, 2023 at 09:46:34AM +0000, Ben Franksen wrote:
>
> Ben Franksen <ben.franksen@online.de> added the comment:
>
> This seems to ba OS specific, it doesn't happen on Linux.
>
> The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer,
> the actual process management is in pipeDocInternal where it says
>
> rval <- waitForProcess pid
>
> 1) What is your pager and with what arguments is it run? (use ps
> while the pager is still active). Depending on environment settings
> and availability, darcs will invoke different pagers.
less with no arguments. Here is an excerpt of pstree output before I
(attempt to) quit less:
| |-+= 88820 falsifia -ksh (ksh)
| | \-+= 97789 falsifia darcs log
| | \--- 35117 falsifia less
> 2) How do you quit the pager exactly?
I press "q".
> 3) Can you still see the pager with ps after it has been "quit",
> i.e. is the process perhaps still running somehow?
It's a zombie. After pressing q:
moth ~ $ ps -lp 35117
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
1000 35117 97789 0 -22 0 0 0 - Z+p pp 0:00.00 (less)
Note this doesn't happen if I scroll to the end of the output before
quitting less. Is the line of code you quoted above reached in that
case?
I could gather a trace of Darcs's system calls (with ktrace, like Linux
strace) if that would help, e.g. to see what it's stuck on.
|
msg23271 (view) |
Author: bfrk |
Date: 2023-05-26.06:49:15 |
|
> It's a zombie.
This comfirms my suspicion.
> Note this doesn't happen if I scroll to the end of the output before
> quitting less. Is the line of code you quoted above reached in that
> case?
We could find out: run the same darcs command with --debug and redirect
stderr to a file, then post the last few lines of that file.
The fact that it works fine if you scroll to the end suggests that the
problem is related to the way we transfer the data to the pager. It looks to
me as if darcs is stuck trying to send more bytes to the pipe, while the
receiving end (the pager) has already terminated. So I think we don't even
reach the line where the pipe is closed but are stuck at
PipeToOther printers -> hPutDocWith printers i inp
A process that terminates should close all its open handles. The question is:
what happens at the other end, i.e. the parent process that tries to write to
the pipe?
> I could gather a trace of Darcs's system calls (with ktrace, like Linux
> strace) if that would help, e.g. to see what it's stuck on.
Before we go that far down, I suggest we exhaust the "printf debuggging"
method. I suggest you clone http://darcs.net/screened, but it should also be
possible to start from http://darcs.net/releases/branch-2.16 (which has the
advantage that you don't have to rebuild all the dependencies). The idea is
to add extra debugMessage calls in pipeDocInternal to narrow down where it
hangs.
|
msg23272 (view) |
Author: bfrk |
Date: 2023-05-26.07:47:40 |
|
I just realized that what we do here is incorrect even on Linux. In
src/Darcs/Util/SignalHandler.hs it says
| s == sigPIPE = exitWith $ ExitFailure 1
And indeed, when I run `darcs log` and then hit 'q' to terminate the
pager, the exit code is 1, which is wrong. I think that, at least when
we pipe our output through a pager, we should ignore SIGPIPE and handle
(i.e catch and ignore) the EPIPE error when the write fails. Looking
closer at the code in src/Darcs/UI/External.hs, I see that there *is*
code to handle EPIPE in the function ortryrunning, which among other
things also handles the ResourceVanished exception ... which (if I
understand things correctly) should not be raised because we do have a
handler for SIGPIPE.
The signal handling code is very old and quite messy, so I may have
overlooked something. And it doesn't explain why the behavior differs
between Linux and OpenBSD.
|
msg23273 (view) |
Author: bfrk |
Date: 2023-05-26.18:13:37 |
|
From the mailing-list, for reference, by gpiero:
I can reproduce this with a default fresh install of OpenBSD 7.3 using
darcs-2.16.5p1 from packages.
The issue, as confirmed by ktrace, is that the darcs process doesn't get
a SIGPIPE. I'm not knowledgeable enough about OpenBSD, but I suspect
this is a form of the old unflushed-buffered-output problem. This
doesn't happen if you scroll all the way down in less or if you execute
`darcs log` in a repo with a short enough log because in that case darcs
has already emitted the whole output. OTOH, if darcs doesn't flush the
pending output it doesn't receive a SIGPIPE and it just waits idle.
How and when the buffered output is 'spontaneously' flushed, or if it's
buffered at all, is OS-specific, so it explains why the issue only
appears to be present in OpenBSD.
Force-flushing the output (or maybe using an unbuffered output) should
solve this and confirm my suspects. Unfortunately, I don't know how to
do it in Haskell, but Ben/Ganesh surely do.
|
msg23274 (view) |
Author: bfrk |
Date: 2023-05-26.18:17:43 |
|
[My reply to the list:]
Easy enough, see attached patch which I recorded against 2.16.5.
Please clone http://darcs.net/releases/branch-2.16, cd into the new
directory, apply the patch, and then run
cabal update
cabal run darcs -- log
to see if the issue is indeed fixed or not. If yes, it would be nice
if you could afterwards obliterate the patch and try again, to
exclude the possibility that the problem was caused by an outdated
dependency (such as a bug in the process packages that has since
been fixed).
[And from gpiero:]
James, can you test Ben's patch?
The OpenBSD instance I used for testing is a 512MB VM on my
resource-starved laptop and there's no way I can recompile darcs
over there.
Attachments
|
msg23275 (view) |
Author: bfrk |
Date: 2023-05-27.18:02:55 |
|
I am pretty sure now that this patch will not fix the problem. We close
the write-end of the pipe immediately after the write, which should do an
implicit flush.
You (gpiero) say
> OTOH, if darcs doesn't flush the
> pending output it doesn't receive a SIGPIPE and it just waits idle.
I don't understand what it is that darcs is waiting for. Why is a signal
needed at all? What is "the old unflushed-buffered-output problem"?
In your ktrace, what is the last system call that darcs does (before it
hangs)?
|
msg23276 (view) |
Author: gpiero |
Date: 2023-05-28.08:33:53 |
|
* [Sat, May 27, 2023 at 06:02:55PM +0000] Ben Franksen:
>I am pretty sure now that this patch will not fix the problem. We close
>the write-end of the pipe immediately after the write, which should do an
>implicit flush.
Yeah, I was sceptic too, but for another reason: I don't think the
program reaches that part of code and the flush should have been placed
more in depth. *But*, I've investigated a little more and darcs seems
stuck inside hPutDocWith and, following the code, hPutBuf (or hPutBuf',
I don't recall) *does* call flush. So it doesn't seem related to a lack
of buffer flushing. Sorry I'm a bit vague here but I had to remove that
VM and... well, I foolishly kept my notes INSIDE the VM (oops).
>You (gpiero) say
>> OTOH, if darcs doesn't flush the
>> pending output it doesn't receive a SIGPIPE and it just waits idle.
>
>I don't understand what it is that darcs is waiting for. Why is a signal
>needed at all? What is "the old unflushed-buffered-output problem"?
I/O is usually buffered: ttys are typically line-buffered (it's flushed
on newlines) and pipes are typically block-buffered (it's flushed when
it's full). You can also flush the buffer explicitly or, implicitly, by
closing the fd.
For the sake of simplicity, here an example in Perl (it's not
language-dependant, but some languages or libraries can call flush under
the hood):
$ perl -E 'print "foobar"; sleep 10;'
This writes the string "foobar" *to the buffer*, sleeps for 10 seconds,
then exits and in doing that closes the stdout and flushes the buffer.
Only then you'll see "foobar" on the screen. Note that 'sleep 10' is
just a stub: you can replace it with days long intensive computation, an
infinite loop or a deadlock. In all cases, you'll only see "foobar" when
the buffer is flushed to stdout.
If you don't want to wait 10 seconds for that important piece of
information, you can append a newline to the string (because it's
line-buffered):
$ perl -E 'say "foobar"; sleep 10;'
or explicitly flush the buffer:
$ perl -E 'print "foobar"; STDOUT->flush; sleep 10;'
Well, in Perl you can also set autoflush, but that's not relevant here.
>In your ktrace, what is the last system call that darcs does (before it
>hangs)?
Unfortunately, the ktrace.out was gone with the VM (oops 2). Will redo
in the next days.
Ciao,
Gian Piero.
|
msg23280 (view) |
Author: falsifian |
Date: 2023-05-28.22:18:44 |
|
> I am pretty sure now that this patch will not fix the problem. We close
> the write-end of the pipe immediately after the write, which should do an
> implicit flush.
For the record, it indeed doesn't.
I wasn't able to build branch-2.16 because my base version is too high,
but I applied what ought to be the same change on top of screened, and
the problem is still there.
For reference, I've attached my version of the patch (which I redid
because yours depends on a tag not in screened) and the output of darcs
log --context after recording that.
Wit that change applied, and running "cabal run darcs -- log" (having
earlier today run "cabal update"), I still see the behaviour I
originally reported.
> You (gpiero) say
> > OTOH, if darcs doesn't flush the
> > pending output it doesn't receive a SIGPIPE and it just waits idle.
>
> I don't understand what it is that darcs is waiting for. Why is a signal
> needed at all? What is "the old unflushed-buffered-output problem"?
>
> In your ktrace, what is the last system call that darcs does (before it
> hangs)?
It's poll(). Here's an exerpt of a system call trace of just the darcs
process (output of kdump -T) leading up to and including the SIGTERM I
eventually sent it.
34354 darcs 1685311642.583002 CALL read(10,0x2edbb57d010,0x1000)
34354 darcs 1685311642.583012 GIO fd 10 read 1 bytes
"\M^?"
34354 darcs 1685311642.583017 RET read 1
34354 darcs 1685311642.583032 CALL poll(0x2edba704a10,2,INFTIM)
34354 darcs 1685311655.373252 PSIG SIGTERM caught handler=0x2eaeb32ed50 mask=0<>
34354 darcs 1685311655.373285 RET futex -1 errno 4 Interrupted system call
Attachments
|
msg23283 (view) |
Author: falsifian |
Date: 2023-05-28.22:55:23 |
|
In case it's helpful:
As an experiment, I tried piping the output of the following simple
program to less. I wrote it this way because after skimming the code
not-very-carefully, I thought "darcs log" amounted to calling
Data.ByteString.hPut a bunch of times (after forking off the pager).
module Main where
import qualified Data.ByteString as B
import Data.String (fromString)
import System.IO (stdout)
main :: IO ()
main = B.hPut stdout (fromString "Hello.\n") >> main
The behaviour is similar to darcs log, but not quite the same. I press
"q" to exit less, and then:
- The (above) program doesn't exit, like with darcs.
- Unlike darcs log, the system call it's stuck on is select. With darcs
log it was poll. So, maybe I'm not successfully reproducing what darcs
log is doing?
- Other differences probably explained by the fact that I did the
pipe-to-less from the shell instead of having the Haskell program call
less: (a) the less process is gone, not a zombie; (b) pressing ctrl-C
(after having quit less) successfully terminates the program.
|
msg23286 (view) |
Author: bfrk |
Date: 2023-05-29.12:35:20 |
|
The select vs. poll is most probably due to a different version of
ghc or perhaps different compiler flags. These calls are used by the
GHC runtime system to multiplex IO between threads.
It is interesting that you can reproduce the gist of the problem
with that simple Haskell program and pipe composition does by the
shell. This is a strong hint that what we have here is a bug in the
GHC runtime on BSD. Before we open an issue on the GHC bug tracker,
let's do a few more experiments:
Does it make a difference whether you have "\n" in the string or
not?
Does the same happen if you replace less-then-quit with /bin/true?
Also, which ghc version are you using? Can you test with the latest
ghc (9.6.2)?
|
msg23287 (view) |
Author: bfrk |
Date: 2023-05-29.12:36:20 |
|
> with that simple Haskell program and pipe composition does by the
> shell. This is a strong hint that what we have here is a bug in the
s/does/done/
|
msg23288 (view) |
Author: falsifian |
Date: 2023-05-29.14:22:43 |
|
Thanks for the suggestions.
> Does it make a difference whether you have "\n" in the string or
> not?
Only that there is a significant pause before less shows its ":" prompt
on the bottom line of the terminal; I guess less is waiting for the
input to reach some line length limit. After that, I press "q", and as
usual my program doesn't exit.
> Does the same happen if you replace less-then-quit with /bin/true?
No,
./f | /usr/bin/true
exits immediately (./f is my compiled program).
> Also, which ghc version are you using? Can you test with the latest
> ghc (9.6.2)?
I have 9.2.7. I'll try compiling the latest ghc, but it may be a while
before I find the time.
|
msg23290 (view) |
Author: gpiero |
Date: 2023-05-29.17:16:04 |
|
For the record, the minimal test case is really minimal:
$ cat <<EOF >test.hs
module Main where
main :: IO ()
main = putStrLn "foobar" >> main
EOF
$ ghc test
$ ./test | less
OpenBSD 7.3 with ghc-9.2.7p1 from packages.
|
msg23291 (view) |
Author: bfrk |
Date: 2023-05-29.18:40:47 |
|
If it fails with this extremely simple test program and a recent
enough ghc (9.2.8 is the version recommended by ghcup) then I think
you should file an bug on the ghc tracker. Sorry for delegating this
to you but I am not in a good position to respond to clarification
requests etc.
It is a bit unsettling that this does not happen with 'true', only
with 'less'. Perhaps the error only manifests if the first write
succeeds. You could try
./test | read x
The read here should terminate after reading one line. What I am
trying to do here is to eliminate less as a possible culprit.
|
msg23292 (view) |
Author: gpiero |
Date: 2023-05-30.10:01:19 |
|
* [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen:
>If it fails with this extremely simple test program and a recent
>enough ghc (9.2.8 is the version recommended by ghcup) then I think
>you should file an bug on the ghc tracker.
James, is it something you can do? I'm not knowledgeable enough about
OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that
VM.
For completeness, in case it's not already solved in recent releases,
it's something to do with the feedback mechanism of a full buffer in
pipes. It doesn't happen with /bin/true, read, etc. because they don't
saturate the buffer. It happens with less, more and similar.
To reproduce:
$ cat <<EOF >test.hs
module Main where
main :: IO ()
main = putStrLn "foobar" >> main
EOF
$ ghc test
$ ./test | perl -E '$SIG{ALRM} = sub { die "exiting on SIGALRM\n" };
alarm 3; while (<>) { sleep .5; print $_ }'
You could need to adjust the arguments to alarm() and sleep() in order
to be sure the pipe is clogged on your system. Removing the sleep() is
sufficient for not having the issue.
|
msg23308 (view) |
Author: falsifian |
Date: 2023-06-02.02:56:37 |
|
On Mon, May 29, 2023 at 06:40:47PM +0000, Ben Franksen wrote:
...
> If it fails with this extremely simple test program and a recent
> enough ghc (9.2.8 is the version recommended by ghcup) then I think
> you should file an bug on the ghc tracker. Sorry for delegating this
> to you but I am not in a good position to respond to clarification
> requests etc.
I am happy to help with this; it will just take a while before I can get
to it.
On Tue, May 30, 2023 at 10:01:19AM +0000, Gian Piero Carrubba wrote:
>
> Gian Piero Carrubba <gpiero@rm-rf.it> added the comment:
>
> * [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen:
> >If it fails with this extremely simple test program and a recent
> >enough ghc (9.2.8 is the version recommended by ghcup) then I think
> >you should file an bug on the ghc tracker.
>
> James, is it something you can do? I'm not knowledgeable enough about
> OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that
> VM.
Yes, I expect I'll be able to figure it out, unless recent changes in
ghc have created more work for the OpenBSD port maintainer.
|
msg23586 (view) |
Author: bfrk |
Date: 2023-07-16.11:57:49 |
|
falsifian wrote:
> I am happy to help with this; it will just take a while before
> I can get to it.
Just a reminder, in case you forgot ;-)
|
msg23629 (view) |
Author: falsifian |
Date: 2023-07-22.14:16:37 |
|
> Just a reminder, in case you forgot ;-)
It's still on my list!
|
msg23724 (view) |
Author: falsifian |
Date: 2023-10-19.22:35:44 |
|
>It's still on my list!
Small update: I tried naïvely updating the OpenBSD port to 9.8.1 but it looks like it will be some work (more than I have time for right now) because some patches no longer apply. I think I will wait until the next time someone updates the OpenBSD port.
|
|
Date |
User |
Action |
Args |
2023-05-18 16:40:33 | falsifian | create | |
2023-05-25 09:46:34 | bfrk | set | messages:
+ msg23269 |
2023-05-25 21:30:42 | falsifian | set | messages:
+ msg23270 |
2023-05-26 06:49:17 | bfrk | set | messages:
+ msg23271 |
2023-05-26 07:47:40 | bfrk | set | messages:
+ msg23272 |
2023-05-26 18:13:37 | bfrk | set | messages:
+ msg23273 |
2023-05-26 18:17:43 | bfrk | set | files:
+ wip-resolve-issue2702.dpatch messages:
+ msg23274 |
2023-05-27 18:02:55 | bfrk | set | messages:
+ msg23275 |
2023-05-28 08:33:53 | gpiero | set | messages:
+ msg23276 |
2023-05-28 22:18:44 | falsifian | set | files:
+ context, wip.dpatch messages:
+ msg23280 |
2023-05-28 22:55:23 | falsifian | set | messages:
+ msg23283 |
2023-05-29 12:35:21 | bfrk | set | messages:
+ msg23286 |
2023-05-29 12:36:20 | bfrk | set | messages:
+ msg23287 |
2023-05-29 14:22:43 | falsifian | set | messages:
+ msg23288 |
2023-05-29 17:16:05 | gpiero | set | messages:
+ msg23290 |
2023-05-29 18:40:47 | bfrk | set | messages:
+ msg23291 |
2023-05-30 10:01:19 | gpiero | set | messages:
+ msg23292 |
2023-06-02 02:56:39 | falsifian | set | messages:
+ msg23308 |
2023-07-16 11:57:49 | bfrk | set | messages:
+ msg23586 |
2023-07-22 14:16:38 | falsifian | set | messages:
+ msg23629 |
2023-10-19 22:35:46 | falsifian | set | messages:
+ msg23724 |
|