Created on 2014-04-15.20:47:32 by gh, last changed 2023-04-01.12:55:26 by bfrk.
msg17352 (view) |
Author: gh |
Date: 2014-04-15.20:47:30 |
|
Packs aim at making repo cloning via HTTP faster. To create packs, the
user must run "darcs optimize --http", which creates packs corresponfing
to the current state of the repository.
When packs get outdated (because of new patches), "darcs get" gets the
packs anyway, and applies the missing patches. The problem is that
outdated packs make cloning *slower* than cloning without packs, since
patch application can be costful.
So I suggest a little change of format and behaviour:
* when creating packs, copy pristine hash to _darcs/packs/pristine
* when getting, compare remote _darcs/packs/pristine to the pristine
hash of _darcs/hashed_inventory
* if _darcs/packs/pristine does not exist, or hash is different, get
normally, otherwise get with packs (function copyPackedRepository2)
Basically this makes darcs clone repository with packs only when they
are up-to-date (modulo pristine hash collision, which can happen, mostly
if the missing patches are tags).
As a bonus, this is retrocompatible with darcs 2.8, but anyway packs
were not enabled by default so I guess we can change them as we wish.
Related:
* <http://darcs.net/Internals/OptimizeHTTP>
* <http://irclog.perlgeek.de/darcs/2014-04-15#i_8592088>
|
msg17353 (view) |
Author: gh |
Date: 2014-04-15.20:49:02 |
|
Sorry for the lack of proof-reading here's a correction:
Outdated packs do *not* make cloning *systematically* slower, but they
can with time.
|
msg17354 (view) |
Author: kowey |
Date: 2014-04-16.09:00:09 |
|
Wasn't the idea behind packs supposed to be that we would fetch from
both sides and meet in the middle?
On 15 April 2014 21:49, Guillaume Hoffmann <bugs@darcs.net> wrote:
>
> Guillaume Hoffmann <guillaumh@gmail.com> added the comment:
>
> Sorry for the lack of proof-reading here's a correction:
>
> Outdated packs do *not* make cloning *systematically* slower, but they
> can with time.
>
> __________________________________
> Darcs bug tracker <bugs@darcs.net>
> <http://bugs.darcs.net/issue2379>
> __________________________________
> _______________________________________________
> darcs-devel mailing list
> darcs-devel@darcs.net
> http://lists.osuosl.org/mailman/listinfo/darcs-devel
--
Eric Kow <http://erickow.com>
|
msg17356 (view) |
Author: gh |
Date: 2014-04-17.18:00:08 |
|
Yes that was the idea, but in the case of getting the last pristine
state, it does not work well in all cases, since outdated packs require
downloading and applying extra patches, which unfortunately is slow in
some real-world cases.
One toy case I made for the sake of the argument is this repo:
<http://www.cs.famaf.unc.edu.ar/~hoffmann/badpacks/> It has 2 patches,
one that introduces a big binary file, and another that replaces its
contents with only a few bytes. Cloning it without packs is much faster
than with.
And I can't think of any way of predicting whether it's worth using
packs+new patches versus pristine downloading.
Now for getting the whole history... actually yes, the "meeting in the
middle" idea works, since we just want to download all patches. So in
the case of patches I think that we should use them in all cases.
That is, my proposal is now:
* when creating packs, copy pristine hash to _darcs/packs/pristine
* when getting, compare remote _darcs/packs/pristine to the pristine
hash of _darcs/hashed_inventory
* if _darcs/packs/pristine does not exist, or hash is different, get
the pristine cache normally, otherwise get it with packs (beginning of
function copyPackedRepository2)
* if _darcs/packs/patches.tar.gz exists, grab this pack and patches in
parallel (end of function copyPackedRepository2)
|
msg17429 (view) |
Author: noreply |
Date: 2014-05-04.20:17:25 |
|
The following patch sent by Guillaume Hoffmann <guillaumh@gmail.com> updated issue issue2379 with
status=resolved;resolvedin=2.10.0 HEAD
* resolve issue2379: only use packs to copy pristine when up-to-date
Ignore-this: 76acb197a8a681ef92c496819b08add5
When creating packs, save pristine hash to _darcs/packs/pristine
If basic pack is outdated, do not fetch it, but fetch patches pack
anyway.
In Darcs.Repository, separate functions between the ones that fetch
basic repository and complete repository (packed or not), and
separate function that clones old-fashioned repositories.
|
msg23243 (view) |
Author: bfrk |
Date: 2023-04-01.12:55:25 |
|
Re-opening for discussion.
> When packs get outdated (because of new patches), "darcs get" gets the
> packs anyway, and applies the missing patches. The problem is that [...]
> patch application can be costful.
Yes, patch application is costly, but there is no reason to do that. We simply
download the missing pristine files after getting the basic pack. (In case you
have trouble seeing how and where this is done: it is happening as a side-effect
of createPristineDirectoryTree.)
It is still true that there are cases where this is slower than only downloading
the current pristine files. The test case mentioned by gh (from the description,
unfortunately the repo is no longer online) has two special properties: (1) the
basic pack is much larger than the sum of current pristine file sizes; and (2)
the number of (current) pristine files is small.
On the other hand, it is easy to see that there are cases where using an outdated
basic pack is much cheaper, namely when there are many pristine files, the pack
is only slightly outdated, and the latest (unpacked) patches only touch a small
percentage of existing files.
So this is a trade-off and we have to decide which case is more likely to occur
in practice. And the answer to that is quite obviously that the latter case is
much more typical than the former.
Projects tend to grow over time, adding more and more files (documentation, test
cases, features). The number of files to download for pristine is the main reason
why (even a lazy) unpacked clone can become unbearably slow unless you have most
of the files already in your global cache. While it may happen occasionally that
you remove a very large file or or make it (much) smaller, the vast majority of
patches make small to medium sized changes that leave the contents of most files
untouched.
Here is a real world example for which I have made a repo available via HTTP as
http://darcs.net/test. This is screened (at the time of writing this) with packs
outdated by 121 patches. The times are the best out of three successive runs to
account for server side caching. The command was:
> time darcs clone http://darcs.net/test -v --no-cache --lazy
With current darcs (use basic pack only if current): 2:54 minutes
With a patched darcs (always use basic patch if available): 0:42 minutes
The improvement factor is roughly 4; it gets better the less packs are outdated,
e.g. with only 30 patches I get something in the order of 20 seconds for the
patched darcs, but of course this depends on the number of files touched by the
patches not included in the pack. The optimum (packs are current) is about 8
seconds for both variants. If you run `darcs optimize http` regularly from a cron
job, then packs will always be at least "almost up to date" and you get a huge
improvement from using basic packs.
I won't be able to send the patches that make this change for some time because
they depend on a number of unrelated improvements.
|
|
Date |
User |
Action |
Args |
2014-04-15 20:47:32 | gh | create | |
2014-04-15 20:49:03 | gh | set | messages:
+ msg17353 |
2014-04-15 21:08:06 | gh | set | title: only clone repositories with packs when they are up-to-date -> only use with packs when up-to-date |
2014-04-15 21:14:55 | gh | set | title: only use with packs when up-to-date -> only use packs when up-to-date |
2014-04-16 09:00:11 | kowey | set | messages:
+ msg17354 title: only use packs when up-to-date -> only clone repositories with packs when they are up-to-date |
2014-04-16 13:35:57 | gh | set | nosy:
+ darcs-devel |
2014-04-16 13:36:18 | gh | set | nosy:
+ simon |
2014-04-17 18:00:10 | gh | set | messages:
+ msg17356 |
2014-05-04 20:17:26 | noreply | set | status: unknown -> resolved messages:
+ msg17429 resolvedin: 2.10.0 |
2023-04-01 12:55:26 | bfrk | set | status: resolved -> has-patch messages:
+ msg23243 |
|