Issue 2703: darcs log hangs after quitting pager

Title	darcs log hangs after quitting pager
Priority		Status	unknown
Milestone		Resolved in
Superseder		Nosy List	falsifian
Assigned To		Topics

Created on 2023-05-18.16:40:33 by falsifian, last changed 2023-10-19.22:35:46 by falsifian.

Files
File name	Uploaded	Type	Edit	Remove
context	falsifian, 2023-05-28.22:18:42	text/plain
wip-resolve-issue2702.dpatch	bfrk, 2023-05-26.18:17:43	application/octet-stream
wip.dpatch	falsifian, 2023-05-28.22:18:44	text/plain

Messages
msg23268 (view)	Author: falsifian	Date: 2023-05-18.16:40:32
1. Summarise the issue (what were doing, what went wrong?) Run "darcs log". Quit the pager (less) without scrolling to the end. less quits but darcs doesn't exit. CTRL-C doesn't work. I could have sworn I reported this a few years ago, but I can't find the issue. http://bugs.darcs.net/issue1278 looks superficially similar, but I can't reproduce that one. 2. What behaviour were you expecting instead? darcs should quit once less does. 3. What darcs version are you using? (Try: darcs --exact-version) From OpenBSD ports; output of darcs --exact-version: darcs compiled on May 8 2023, at 07:59:54 Weak Hash: not available Context: [TAG 2.16.5 Ganesh Sittampalam <ganesh@earth.li>**20220220083531 Ignore-this: 51417cb5dd488dc0d962ffb051f63ee1 ] 4. What operating system are you running? OpenBSD-current on amd64
msg23269 (view)	Author: bfrk	Date: 2023-05-25.09:46:30
This seems to ba OS specific, it doesn't happen on Linux. The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer, the actual process management is in pipeDocInternal where it says rval <- waitForProcess pid 1) What is your pager and with what arguments is it run? (use ps while the pager is still active). Depending on environment settings and availability, darcs will invoke different pagers. 2) How do you quit the pager exactly? 3) Can you still see the pager with ps after it has been "quit", i.e. is the process perhaps still running somehow? The fact that Ctrl-C doesn't work even after quitting the pager strongly suggests that the pager is still running, since we use the delegate_ctlc feature of System.Process, that is, the parent process (darcs) is ignoring SIGINT untill the child process is finished. This is so that pagers which have no way to quit them other via Ctrl-C can be quit without automatically quitting darcs itself, too.
msg23270 (view)	Author: falsifian	Date: 2023-05-25.21:30:38
On Thu, May 25, 2023 at 09:46:34AM +0000, Ben Franksen wrote: > > Ben Franksen <ben.franksen@online.de> added the comment: > > This seems to ba OS specific, it doesn't happen on Linux. > > The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer, > the actual process management is in pipeDocInternal where it says > > rval <- waitForProcess pid > > 1) What is your pager and with what arguments is it run? (use ps > while the pager is still active). Depending on environment settings > and availability, darcs will invoke different pagers. less with no arguments. Here is an excerpt of pstree output before I (attempt to) quit less: \| \|-+= 88820 falsifia -ksh (ksh) \| \| \-+= 97789 falsifia darcs log \| \| \--- 35117 falsifia less > 2) How do you quit the pager exactly? I press "q". > 3) Can you still see the pager with ps after it has been "quit", > i.e. is the process perhaps still running somehow? It's a zombie. After pressing q: moth ~ $ ps -lp 35117 UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND 1000 35117 97789 0 -22 0 0 0 - Z+p pp 0:00.00 (less) Note this doesn't happen if I scroll to the end of the output before quitting less. Is the line of code you quoted above reached in that case? I could gather a trace of Darcs's system calls (with ktrace, like Linux strace) if that would help, e.g. to see what it's stuck on.
msg23271 (view)	Author: bfrk	Date: 2023-05-26.06:49:15
> It's a zombie. This comfirms my suspicion. > Note this doesn't happen if I scroll to the end of the output before > quitting less. Is the line of code you quoted above reached in that > case? We could find out: run the same darcs command with --debug and redirect stderr to a file, then post the last few lines of that file. The fact that it works fine if you scroll to the end suggests that the problem is related to the way we transfer the data to the pager. It looks to me as if darcs is stuck trying to send more bytes to the pipe, while the receiving end (the pager) has already terminated. So I think we don't even reach the line where the pipe is closed but are stuck at PipeToOther printers -> hPutDocWith printers i inp A process that terminates should close all its open handles. The question is: what happens at the other end, i.e. the parent process that tries to write to the pipe? > I could gather a trace of Darcs's system calls (with ktrace, like Linux > strace) if that would help, e.g. to see what it's stuck on. Before we go that far down, I suggest we exhaust the "printf debuggging" method. I suggest you clone http://darcs.net/screened, but it should also be possible to start from http://darcs.net/releases/branch-2.16 (which has the advantage that you don't have to rebuild all the dependencies). The idea is to add extra debugMessage calls in pipeDocInternal to narrow down where it hangs.
msg23272 (view)	Author: bfrk	Date: 2023-05-26.07:47:40
I just realized that what we do here is incorrect even on Linux. In src/Darcs/Util/SignalHandler.hs it says \| s == sigPIPE = exitWith $ ExitFailure 1 And indeed, when I run `darcs log` and then hit 'q' to terminate the pager, the exit code is 1, which is wrong. I think that, at least when we pipe our output through a pager, we should ignore SIGPIPE and handle (i.e catch and ignore) the EPIPE error when the write fails. Looking closer at the code in src/Darcs/UI/External.hs, I see that there is code to handle EPIPE in the function ortryrunning, which among other things also handles the ResourceVanished exception ... which (if I understand things correctly) should not be raised because we do have a handler for SIGPIPE. The signal handling code is very old and quite messy, so I may have overlooked something. And it doesn't explain why the behavior differs between Linux and OpenBSD.
msg23273 (view)	Author: bfrk	Date: 2023-05-26.18:13:37
From the mailing-list, for reference, by gpiero: I can reproduce this with a default fresh install of OpenBSD 7.3 using darcs-2.16.5p1 from packages. The issue, as confirmed by ktrace, is that the darcs process doesn't get a SIGPIPE. I'm not knowledgeable enough about OpenBSD, but I suspect this is a form of the old unflushed-buffered-output problem. This doesn't happen if you scroll all the way down in less or if you execute `darcs log` in a repo with a short enough log because in that case darcs has already emitted the whole output. OTOH, if darcs doesn't flush the pending output it doesn't receive a SIGPIPE and it just waits idle. How and when the buffered output is 'spontaneously' flushed, or if it's buffered at all, is OS-specific, so it explains why the issue only appears to be present in OpenBSD. Force-flushing the output (or maybe using an unbuffered output) should solve this and confirm my suspects. Unfortunately, I don't know how to do it in Haskell, but Ben/Ganesh surely do.
msg23274 (view)	Author: bfrk	Date: 2023-05-26.18:17:43
[My reply to the list:] Easy enough, see attached patch which I recorded against 2.16.5. Please clone http://darcs.net/releases/branch-2.16, cd into the new directory, apply the patch, and then run cabal update cabal run darcs -- log to see if the issue is indeed fixed or not. If yes, it would be nice if you could afterwards obliterate the patch and try again, to exclude the possibility that the problem was caused by an outdated dependency (such as a bug in the process packages that has since been fixed). [And from gpiero:] James, can you test Ben's patch? The OpenBSD instance I used for testing is a 512MB VM on my resource-starved laptop and there's no way I can recompile darcs over there. Attachments wip-resolve-issue2702.dpatch
msg23275 (view)	Author: bfrk	Date: 2023-05-27.18:02:55
I am pretty sure now that this patch will not fix the problem. We close the write-end of the pipe immediately after the write, which should do an implicit flush. You (gpiero) say > OTOH, if darcs doesn't flush the > pending output it doesn't receive a SIGPIPE and it just waits idle. I don't understand what it is that darcs is waiting for. Why is a signal needed at all? What is "the old unflushed-buffered-output problem"? In your ktrace, what is the last system call that darcs does (before it hangs)?
msg23276 (view)	Author: gpiero	Date: 2023-05-28.08:33:53
* [Sat, May 27, 2023 at 06:02:55PM +0000] Ben Franksen: >I am pretty sure now that this patch will not fix the problem. We close >the write-end of the pipe immediately after the write, which should do an >implicit flush. Yeah, I was sceptic too, but for another reason: I don't think the program reaches that part of code and the flush should have been placed more in depth. But, I've investigated a little more and darcs seems stuck inside hPutDocWith and, following the code, hPutBuf (or hPutBuf', I don't recall) does call flush. So it doesn't seem related to a lack of buffer flushing. Sorry I'm a bit vague here but I had to remove that VM and... well, I foolishly kept my notes INSIDE the VM (oops). >You (gpiero) say >> OTOH, if darcs doesn't flush the >> pending output it doesn't receive a SIGPIPE and it just waits idle. > >I don't understand what it is that darcs is waiting for. Why is a signal >needed at all? What is "the old unflushed-buffered-output problem"? I/O is usually buffered: ttys are typically line-buffered (it's flushed on newlines) and pipes are typically block-buffered (it's flushed when it's full). You can also flush the buffer explicitly or, implicitly, by closing the fd. For the sake of simplicity, here an example in Perl (it's not language-dependant, but some languages or libraries can call flush under the hood): $ perl -E 'print "foobar"; sleep 10;' This writes the string "foobar" to the buffer, sleeps for 10 seconds, then exits and in doing that closes the stdout and flushes the buffer. Only then you'll see "foobar" on the screen. Note that 'sleep 10' is just a stub: you can replace it with days long intensive computation, an infinite loop or a deadlock. In all cases, you'll only see "foobar" when the buffer is flushed to stdout. If you don't want to wait 10 seconds for that important piece of information, you can append a newline to the string (because it's line-buffered): $ perl -E 'say "foobar"; sleep 10;' or explicitly flush the buffer: $ perl -E 'print "foobar"; STDOUT->flush; sleep 10;' Well, in Perl you can also set autoflush, but that's not relevant here. >In your ktrace, what is the last system call that darcs does (before it >hangs)? Unfortunately, the ktrace.out was gone with the VM (oops 2). Will redo in the next days. Ciao, Gian Piero.
msg23280 (view)	Author: falsifian	Date: 2023-05-28.22:18:44
> I am pretty sure now that this patch will not fix the problem. We close > the write-end of the pipe immediately after the write, which should do an > implicit flush. For the record, it indeed doesn't. I wasn't able to build branch-2.16 because my base version is too high, but I applied what ought to be the same change on top of screened, and the problem is still there. For reference, I've attached my version of the patch (which I redid because yours depends on a tag not in screened) and the output of darcs log --context after recording that. Wit that change applied, and running "cabal run darcs -- log" (having earlier today run "cabal update"), I still see the behaviour I originally reported. > You (gpiero) say > > OTOH, if darcs doesn't flush the > > pending output it doesn't receive a SIGPIPE and it just waits idle. > > I don't understand what it is that darcs is waiting for. Why is a signal > needed at all? What is "the old unflushed-buffered-output problem"? > > In your ktrace, what is the last system call that darcs does (before it > hangs)? It's poll(). Here's an exerpt of a system call trace of just the darcs process (output of kdump -T) leading up to and including the SIGTERM I eventually sent it. 34354 darcs 1685311642.583002 CALL read(10,0x2edbb57d010,0x1000) 34354 darcs 1685311642.583012 GIO fd 10 read 1 bytes "\M^?" 34354 darcs 1685311642.583017 RET read 1 34354 darcs 1685311642.583032 CALL poll(0x2edba704a10,2,INFTIM) 34354 darcs 1685311655.373252 PSIG SIGTERM caught handler=0x2eaeb32ed50 mask=0<> 34354 darcs 1685311655.373285 RET futex -1 errno 4 Interrupted system call Attachments context wip.dpatch
msg23283 (view)	Author: falsifian	Date: 2023-05-28.22:55:23
In case it's helpful: As an experiment, I tried piping the output of the following simple program to less. I wrote it this way because after skimming the code not-very-carefully, I thought "darcs log" amounted to calling Data.ByteString.hPut a bunch of times (after forking off the pager). module Main where import qualified Data.ByteString as B import Data.String (fromString) import System.IO (stdout) main :: IO () main = B.hPut stdout (fromString "Hello.\n") >> main The behaviour is similar to darcs log, but not quite the same. I press "q" to exit less, and then: - The (above) program doesn't exit, like with darcs. - Unlike darcs log, the system call it's stuck on is select. With darcs log it was poll. So, maybe I'm not successfully reproducing what darcs log is doing? - Other differences probably explained by the fact that I did the pipe-to-less from the shell instead of having the Haskell program call less: (a) the less process is gone, not a zombie; (b) pressing ctrl-C (after having quit less) successfully terminates the program.
msg23286 (view)	Author: bfrk	Date: 2023-05-29.12:35:20
The select vs. poll is most probably due to a different version of ghc or perhaps different compiler flags. These calls are used by the GHC runtime system to multiplex IO between threads. It is interesting that you can reproduce the gist of the problem with that simple Haskell program and pipe composition does by the shell. This is a strong hint that what we have here is a bug in the GHC runtime on BSD. Before we open an issue on the GHC bug tracker, let's do a few more experiments: Does it make a difference whether you have "\n" in the string or not? Does the same happen if you replace less-then-quit with /bin/true? Also, which ghc version are you using? Can you test with the latest ghc (9.6.2)?
msg23287 (view)	Author: bfrk	Date: 2023-05-29.12:36:20
> with that simple Haskell program and pipe composition does by the > shell. This is a strong hint that what we have here is a bug in the s/does/done/
msg23288 (view)	Author: falsifian	Date: 2023-05-29.14:22:43
Thanks for the suggestions. > Does it make a difference whether you have "\n" in the string or > not? Only that there is a significant pause before less shows its ":" prompt on the bottom line of the terminal; I guess less is waiting for the input to reach some line length limit. After that, I press "q", and as usual my program doesn't exit. > Does the same happen if you replace less-then-quit with /bin/true? No, ./f \| /usr/bin/true exits immediately (./f is my compiled program). > Also, which ghc version are you using? Can you test with the latest > ghc (9.6.2)? I have 9.2.7. I'll try compiling the latest ghc, but it may be a while before I find the time.
msg23290 (view)	Author: gpiero	Date: 2023-05-29.17:16:04
For the record, the minimal test case is really minimal: $ cat <<EOF >test.hs module Main where main :: IO () main = putStrLn "foobar" >> main EOF $ ghc test $ ./test \| less OpenBSD 7.3 with ghc-9.2.7p1 from packages.
msg23291 (view)	Author: bfrk	Date: 2023-05-29.18:40:47
If it fails with this extremely simple test program and a recent enough ghc (9.2.8 is the version recommended by ghcup) then I think you should file an bug on the ghc tracker. Sorry for delegating this to you but I am not in a good position to respond to clarification requests etc. It is a bit unsettling that this does not happen with 'true', only with 'less'. Perhaps the error only manifests if the first write succeeds. You could try ./test \| read x The read here should terminate after reading one line. What I am trying to do here is to eliminate less as a possible culprit.
msg23292 (view)	Author: gpiero	Date: 2023-05-30.10:01:19
* [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen: >If it fails with this extremely simple test program and a recent >enough ghc (9.2.8 is the version recommended by ghcup) then I think >you should file an bug on the ghc tracker. James, is it something you can do? I'm not knowledgeable enough about OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that VM. For completeness, in case it's not already solved in recent releases, it's something to do with the feedback mechanism of a full buffer in pipes. It doesn't happen with /bin/true, read, etc. because they don't saturate the buffer. It happens with less, more and similar. To reproduce: $ cat <<EOF >test.hs module Main where main :: IO () main = putStrLn "foobar" >> main EOF $ ghc test $ ./test \| perl -E '$SIG{ALRM} = sub { die "exiting on SIGALRM\n" }; alarm 3; while (<>) { sleep .5; print $_ }' You could need to adjust the arguments to alarm() and sleep() in order to be sure the pipe is clogged on your system. Removing the sleep() is sufficient for not having the issue.
msg23308 (view)	Author: falsifian	Date: 2023-06-02.02:56:37
On Mon, May 29, 2023 at 06:40:47PM +0000, Ben Franksen wrote: ... > If it fails with this extremely simple test program and a recent > enough ghc (9.2.8 is the version recommended by ghcup) then I think > you should file an bug on the ghc tracker. Sorry for delegating this > to you but I am not in a good position to respond to clarification > requests etc. I am happy to help with this; it will just take a while before I can get to it. On Tue, May 30, 2023 at 10:01:19AM +0000, Gian Piero Carrubba wrote: > > Gian Piero Carrubba <gpiero@rm-rf.it> added the comment: > > * [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen: > >If it fails with this extremely simple test program and a recent > >enough ghc (9.2.8 is the version recommended by ghcup) then I think > >you should file an bug on the ghc tracker. > > James, is it something you can do? I'm not knowledgeable enough about > OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that > VM. Yes, I expect I'll be able to figure it out, unless recent changes in ghc have created more work for the OpenBSD port maintainer.
msg23586 (view)	Author: bfrk	Date: 2023-07-16.11:57:49
falsifian wrote: > I am happy to help with this; it will just take a while before > I can get to it. Just a reminder, in case you forgot ;-)
msg23629 (view)	Author: falsifian	Date: 2023-07-22.14:16:37
> Just a reminder, in case you forgot ;-) It's still on my list!
msg23724 (view)	Author: falsifian	Date: 2023-10-19.22:35:44
>It's still on my list! Small update: I tried naïvely updating the OpenBSD port to 9.8.1 but it looks like it will be some work (more than I have time for right now) because some patches no longer apply. I think I will wait until the next time someone updates the OpenBSD port.

History
Date	User	Action	Args
2023-05-18 16:40:33	falsifian	create
2023-05-25 09:46:34	bfrk	set	messages: + msg23269
2023-05-25 21:30:42	falsifian	set	messages: + msg23270
2023-05-26 06:49:17	bfrk	set	messages: + msg23271
2023-05-26 07:47:40	bfrk	set	messages: + msg23272
2023-05-26 18:13:37	bfrk	set	messages: + msg23273
2023-05-26 18:17:43	bfrk	set	files: + wip-resolve-issue2702.dpatch messages: + msg23274
2023-05-27 18:02:55	bfrk	set	messages: + msg23275
2023-05-28 08:33:53	gpiero	set	messages: + msg23276
2023-05-28 22:18:44	falsifian	set	files: + context, wip.dpatch messages: + msg23280
2023-05-28 22:55:23	falsifian	set	messages: + msg23283
2023-05-29 12:35:21	bfrk	set	messages: + msg23286
2023-05-29 12:36:20	bfrk	set	messages: + msg23287
2023-05-29 14:22:43	falsifian	set	messages: + msg23288
2023-05-29 17:16:05	gpiero	set	messages: + msg23290
2023-05-29 18:40:47	bfrk	set	messages: + msg23291
2023-05-30 10:01:19	gpiero	set	messages: + msg23292
2023-06-02 02:56:39	falsifian	set	messages: + msg23308
2023-07-16 11:57:49	bfrk	set	messages: + msg23586
2023-07-22 14:16:38	falsifian	set	messages: + msg23629
2023-10-19 22:35:46	falsifian	set	messages: + msg23724

Issue 2703 darcs log hangs after quitting pager

Attachments

Attachments