A Very Specific Prolog Warning
The Very Short Version:
If you’re going to open an input file stream in SWI-Prolog, then pass that stream to a foreign function that’s going to use the file descriptor, make sure you open it with bom(false)
– e.g. open(File, read, Stream, [bom(false)]), some_foreign_pred(Stream).
.
The Slightly Longer Version
How did I come across this?
As part of a little project I’m working on, I wanted the ability to create a process from SWI-Prolog & redirect its input & output to files that would be determined at runtime.
SWI-Prolog’s process_create/3
is based on SICStus’ API and so it lets you have the standard in/out/error be set to either std
(meaning the same as the parent process), null
, or a pipe.
The pipe option seemed like it would be what I want, but it requires that you pass in an uninstantiated variable for the pipe.
What you have to do then, is get the pipe, open up the file streams in the parent process, and use some threads to “pump” the data in and out of the pipes.
This seems like it should be doable enough, if kind of annoying, but I kept running into weird edge-cases in setting this up, so I decided to just have a go at extending process_create
to allow passing in a file stream.
It would make my code much simpler, make it easier for other folks to use in the future, and theoretically be much more efficient as well.
I’ve previously added functionality to SWI Prolog by doing some C stuff and it was a pretty straightforward & pleasant experience, so I dove right in. It was even easier than I thought it would be – my pull request ended up only having about 30 lines changed1 – and in my testing, worked fine for redirecting standard output & standard error to files.
However, redirecting stdin to come from a file was having some confusing issues.
Initially, it seemed like it wasn’t able to read anything – I’d do something like the below & see wc
outputting “0”.
?- open("/tmp/foo", read, S, []), process_create("/usr/bin/wc", ["-c"], [ stdin(stream(S)) ]).
However, if the input file was over a certain size, it would see some input, albeit only some of it – as if it were offset into the file…
At this point, I clearly needed more insight in to what exactly the spawned process was getting, so I turned to the extremely-useful, if sometimes-baffling lsof
tool.
I tweaked the Prolog program to print out the PID of the process and changed the process that was running to take a little longer (essentially making it sleep 60; wc
), so I’d be able to poke around while it was running.
Once it was started & I got the PID, I could then run lsof -p $PID
and see that, indeed, while the input file descriptor was pointing to the file I was trying to point it to, it was offset, looking like it had been advanced one block in.
This would be both why short files gave no output (because one block was the size of the entire file) and longer files would give partial output.
I now knew what was happening, but still needed to ascertain a solution.
After some experimentation with seeking & various options to open/4
, I eventually figured out the solution at the top, setting bom
to false
.
What this does is tell Prolog not to try to find a byte-order marker at the beginning of the file, since to try to find that marker, it starts consuming the file.
This is okay if the stream is being used inside Prolog, since it can buffer the input that was read, but the underlying file descriptor has been seek’d, which means that passing it to foreign functions results in lost input.
I don’t imagine very many people will run in to this exact issue, but it was weird enough that I thought it worth documenting, just in case some else (or, more likely, myself in a few years) runs into a similar situation.