"Publishing" in Soyuz encompasses a range of activities done on a regular cycle:
Creating publication records for accepted PackageUploads as described in Soyuz/TechnicalDetails/UploadProcessor
- Putting package files in the archive repository
- Updating the package indexes in the archive repository
- Working out which package versions should be marked "superseded"
- Removing files from the archive repository that are no longer needed
There are three publishers, one for Ubuntu, one for PPAs and a third for derivative distributions:
Derivatives: cronscripts/publish-ftpmaster.py --all-derived
The Ubuntu publisher runs at the top of each hour, every hour. The PPA and derivatives publishers run every 5 minutes, or less often if the previous run takes longer than 5 minutes.
The PPA publisher is very straightforward; it's a wrapper script that calls process-accepted.py and then publish-distro,py.
The distro publishers are far more complicated. Here's a rough overview of the workflow.
Distribution publisher workflow
- If a new series was created since we last ran, do a full publication of its indexes.
- Make a backup of the "dists/" directory.
- Publish the pending packages into the pool and the backup dists tree.
Run the custom scripts in publish-distro.d/
- Atomically (as possible) rename the backup dists tree into the real one.
- Create ls-lR.gz listings.
- Clean out empty directories
Run the custom scripts in finalize.d/
All of this is what the cronscripts/publisher-ftpmaster.py wrapper script does. It defers to other scripts (script objects actually, we don't Popen() new processes) to do the meat of the work. Consider it the conductor for the publishing orchestra, if you like.
The vast bulk of the work happens in this script. It does the intricate work of assembling a Debian-style repository based on the metadata and files available in Launchpad's database. It goes through various phases of operation, each method on the publisher script called in turn:
A_publish(): Scans all the PENDING publications and grabs their files out of the librarian and puts them in the pool/ - it is very careful not to overwrite existing files if they have different contents, in fact this is a critical condition to have and indicates a serious problem earlier in the Soyuz pipeline.
B_dominate(): Work out which packages are now superseded. If newer packages exist than ones that are currently published, those published are marked superseded, which will mean they eventually get removed from the archive.
- Index writing, done in one of two different ways:
C_doFTPArchive: Used for distributions, it calls an external process called apt-ftparchive after building its input files
C_writeIndexes: Used for PPAs, this is the internal code which is simpler and quicker but lacking some features that apt-ftparchive has.
D_writeRelease: Write Release files (they contain a list of all Sources and Packages files with their checksums) which are later signed by scripts in finalize.d/ to make a Release.gpg.
Death row processing
In a separate cron job, but still considered part of the publisher, we run scripts/process-death-row.py (every 30 minutes for PPAs or hourly for distros) which will examine all the superseded sources and work out if we can remove their files from the pool yet. There are many considerations, such as GPL conformance and whether the same source is still published in a different place in the archive.
After a file is condemned, it gets a stay of execution of around a day or so, after which it will be permanently removed from the archive.
Cleaning up the librarian
Removed files remain in the librarian for longer, but there is a script called cronscripts/expire-archive-files.py which currently only processes PPAs and removes files from the librarian 7 days after the package files are removed from the archive repository.
Distribution files in the librarian are currently only removed when a distribution series goes obsolete at its end of life. This is currently a manual task requiring SQL.