Diff for "ArchiveIndex"

Not logged in - Log In / Register

Differences between revisions 7 and 8
Revision 7 as of 2010-11-02 15:34:22
Size: 8143
Editor: mpt
Comment: update to reflect the retirement of Canonical Partners
Revision 8 as of 2010-11-04 09:38:45
Size: 10910
Editor: mpt
Comment: updates from mvo
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from LEP/SoyuzArchiveIndex
## page was renamed from ArchiveIndex
Line 39: Line 37:
The relevant data can be extracted by inspecting the deb package after
the build (similar to what the pot file extraction is
doing). Alternatively it can do the processing in batches by querrying
what new deb packages became available since the last run of the
extraction script. Please note that additional source may be used that
are not inside the package itself (e.g. the popcon database or in the
future user-generated meta-data).

The implementation should be flexible as there is more information
that can be extracted. Initially we want the desktop file and the icon
associated with it. But information about the commands that the file
puts into /bin/ and /usr/bin for command-not-found is also
interessting.
Line 41: Line 53:
 * Opaque to apt, so doesn't really matter
 * RFC-822
 * Localization of categories and keywords (in the translations file?)

=== Soyuz process ===

 * Strip the data out of the package when building it, store it somewhere
 * Publish the metadata file along with everything else

=== Roadmap ===

 * Maybe start with non-localized data for PPAs, then localized in a future version?
 * Maybe still have Ubuntu Software Center using appnstall-data instead for Main and Universe initially
  * iteratively survey the differences between app-install-data and the metadata Soyuz is producing
   * fix bugs in the packages and/or in Soyuz

actions:
 - client: write scripts to extract the needed data to LP
 - client: provide examples what the file

=== Issues ===

 * Often a .desktop file is in a separate package from the package you're actually interested in
  * e.g. wesnoth-data vs. wesnoth
  * e.g. emacs-common contains the icon for emacs22
  * maybe this should be fixed in the packages themselves

 * Debian may or may not be interested this
  * e.g. keeping packages and debtags in sync

 * If bulk of metadata is not in Packages file then we can be nicer to Launchpad

 * Filling the Librarian with icons is necessary but annoying
  * garbage-collect them when done?

 * Current list view description translations should be migrated from app-install-data (currently ca. 4000 strings)
   [long description translations come from DDTP data in the archive, but list view's short descriptions come from .desktop file's Comment field which is translated in app-install-data's Rosetta template]

Two parts:
 1) LP export
 2) apt support for downloading this

LP:
 * what to export?
  * per arch, per pocket (main, universe)
  * one pkg -> multiple apps, app names are not uniq
  * desktop data parts
   * appname
One (but potentially more in the future) file should be geneated
alongside the Packages.gz file to store the additional relevant
meta-data from the desktop files. This file should be called "AppInfo"
for the C locale and "AppInfo-$lang" for the other locales.

Space and amount of files (because of rsync) are a issues for mirrors.
The AppInfo/CommandNotFound data is small and textual, so its no
problem to publish it in e.g. rfc822 format alongside Packages file
(the extact format does not matter to apt, it will just download it,
other apps like update-apt-xapian-index will parse it). But to make it
consistent with the rest of data we should simply use RFC-822.

A example file might look like this:
{{{
Package: gnome-utils
Version: 2.23.1-0ubuntu1
Popcon: 17939
Section: main
Icon: baobab
Name: Disk Usage Analyzer
Comment: Check folder sizes and available disk space
Exec: baobab
Categories: GTK;GNOME;Utility;
}}}

and a localized one:
{{{
Package: gnome-utils
Version: 2.23.1-0ubuntu1
Popcon: 17939
Section: main
Icon: baobab
Name-de: Festplatten Überprüfer
Comment-de: Überprüfen des verfügbaren Platzes
Exec: baobab
Categories: GTK;GNOME;Utility;
}}}

Icons are relatively big (~5k/app; currently we have ~1800 icons=7Mb)
so its not feasible to stuff them into a single file, especially if we
expect a lot of churn (like on extra.ubuntu.com that will also use
this system). For this reason, they should be published as individual
files on e.g.
http://archive.ubuntu.com/ubuntu/dists/maverick/main/icons (or)
http://archive.ubuntu.com/ubuntu/dists/maverick/icons

This means that its the job of the client to dynamically fetch the
icons from the local mirror and cache them. This is what is done on
e.g. android as well.

=== Overrides ===

A package may want to override the AppInfo for itself instead of using
the desktop file. This should be supported on multiple levels.

It should be possible to blacklist a package via "XB-NoAppInfo: 1" in
its control file. This means that the package will not be scanned for
desktop files at all.

It should also be possible to override all of the appinfo by having a
debian/appinfo file in the source package that overrides the desktop
extraction entirely and forces the system to simply use this
file. This requires the extraction to look into the source package for
this file first (if that is tricky to implement we could move it into
the binary to a known path).

And finally if the .desktop file contains "X-AppInfo" fields already,
the extractor should honor those and keep them. This is useful if
e.g. X-AppInfo-Package is wrong on extraction. If a desktop file lifes
in "wesnoth-common" but the package we want is "wesnoth" this is a
good way to override it.

=== Extraction ===

The current data extractor can be found in lp:~mvo/archive-crawler/mvo
A similar (but more clever) approach as this script may be taken to
gather the data.

On a soyuz machine with a full mirror the script runs every hour (or
two hours) and checks with the DB what new deb packages are available
since it ran last. Those are fetched and inspected and written to a
local sqlite database (or the LP database) and the icons are extracted
and stored as well. Because all the data can be rebuild by simply
running the extraction again its probably enough to have a local cache
db. Then the script generates the AppInfo and AppInfo-$lang files and
populates the icon directory. In this step it also needs to do orphan
clenaup, i.e. removing packages that are no longer in the archive. A
rsync cron job is then required to rsync the generated file onto
archive.ubuntu.com.

Icon extraction can be tricky as the icon may be stored in a different
package than the desktop file (e.g. emacs-common vs emacs23).

Stuff we want to extract:

 * desktop file:
   * appname (potentially multi language)
Line 90: Line 151:
   * pkgversion (to ensure we can validate our data is not stale)
Line 91: Line 153:
   * popcon / rating
   * keywords
   * popcon - probably not needed anymore once we have ratings
   * keywords (X-AppInstall-Keywords)
Line 96: Line 158:
  * command not found data
   * per arch, per pocket (main, universe)
   * codec-info

 * Icons
   * as individual files that get dynamically fetched

 * command not found data
Line 99: Line 165:
   * PROBLEM diverts etc, real world problem
  * icons ?
   * uuencode?
   * gkticoncache ?
   * size?
   * format?
  * tagfile/rfc822 just like Packages
 * how many files?
  * command-not-found
  * software-center
  * icons

 * we need to hook into the build process to extract the desktop file, command not found file data, icons etc (problem is diverts)

=== Initial Implementation ===

 * Put icons into repository
   * options: big tarball or gtk-icon-cache
   
 * something needs to go once the build is finished and extract the .desktop file and icons, then export that metadata
   * setup another script once build is finished

 * need to create a standard for what kind of metadata can be supplied within the package
   * PROBLEM diverts etc, real world problem for e.g. vim

=== Roadmap ===

 * Start with non-localized data and only for PPAs
 * Expand to the full archive
   * iteratively survey the differences between app-install-data and the metadata Soyuz is producing
   * fix bugs in the packages and/or in Soyuz

=== Issues ===

 * A .desktop file is in a separate package from the package you're actually interested in
  * e.g. wesnoth-data vs. wesnoth
  * e.g. emacs-common contains the icon for emacs22
  * this should be fixed in the packages themselves with the override mechanism

 * A icon file is in a seperate package from the desktop file
   * check how common that actually is

 * Keywords/comments should be available in rosetta for translation, ideally as a additional template in the ubuntu package translation page

 * one pkg may contain multiple apps, app names are not uniq (e.g. Terminal is used multiple times)


=== Open issues ===

 * need to disuss and create a standard for what kind of metadata can be supplied within the package and finalize the override mechanism
Line 126: Line 196:
 * Things this metadata might contain:
   * icons
   * package descriptions (also translated)
   * restart required
   * not screenshots/movies/sounds (handled elsewhere as not downloaded up-front in software center)
   * mimetype
 * provide a way (LP/external site) to allow easy modifications/cleanup of metadata (improve descriptions, improve categories etc)

 * Things this metadata might contain in the future:
Line 134: Line 201:
     * similar problem to ratings updates
   * keywords and keyword translations

 * we have information that can be extracted from a upload (icon)
 * and changes that happen after a upload (rating)
 
=== Launchpad team requirements ===

The Launchpad team needs from the Ubuntu Software Center team:
 * File format for the metadata
 * TODO - software center team: The code to inspect Debian packages and extract the metadata

Archive index

As an Ubuntu packager
I want correct names, icons, and categories for applications to appear in Ubuntu Software Center automatically
so that I don’t need to remember to do it myself, or run the risk of messing it up.

As an application developer
I want to see the eventual application name, icon, category etc when the application is in my testing PPA
so that I can correct any errors before the application reaches one of the official repositories.

Ubuntu’s app-install-data-ubuntu and app-install-data-commercial packages, and for-purchase application metadata, should be replaced with a single automated system. Soyuz should produce for each archive it controls — including Multiverse, Canonical Partners, and every other PPA — an index of names, icons, summaries, categories, and keywords for all software items in the archive. (A “software item” in this sense mostly corresponds to a binary package, but in some cases one binary package contains multiple applications that should have separate information.) Launchpad should put this index in a standard place in the archive, and rebuild it whenever a package is added or changed.

Rationale

Since the beginning of Ubuntu’s Lucid cycle, we have wanted to get rid of app-install-data-ubuntu and app-install-data-commercial:

  • they are slow and difficult to update
  • the time we want to update app-install-data-ubuntu is when the archive is frozen

  • there are exceptions and bugs, so software shows up in Ubuntu Software Center that isn’t installable (or conversely, is hidden as a “technical item” when it isn’t)
  • they work only for Main, Universe, and Partner, not for PPAs, Multiverse, or third-party archives.

Ongoing cost of not doing this

  • Whenever a graphical application is added to Main or Universe, or its icon changes, Michael Vogt needs to rebuild the app-install-data-ubuntu package. Almost every bug in this package represents a cost of not generating this index automatically.

  • Whenever a for-purchase application is added to the Ubuntu Software Center store, Brian Thomason or Michael Vogt needs to manually add metadata for the package to the Software Center Agent. This can’t scale beyond a few dozen applications per week. (They also need to register the price of the application; that needs to be automated separately.)
  • Whenever a graphical application is added to (or updated in) Canonical Partners, Brian Thomason has needed to remember to rebuild the app-install-data-commercial package. Almost every bug in that package represents a cost of not generating that index automatically. (In Natty, Canonical Partners will be merged with the for-purchase archive, but that will just change one kind of manual work to another.)

  • Whenever an open-source application goes through the post-release process, the packager needs to add custom metadata fields to debian/control, metadata that duplicates existing fields in the application’s .desktop file. They shouldn’t have to do this.

Stakeholders

Matthew Paul Thomas, representing Michael Vogt and Brian Thomason

Implementation

The relevant data can be extracted by inspecting the deb package after the build (similar to what the pot file extraction is doing). Alternatively it can do the processing in batches by querrying what new deb packages became available since the last run of the extraction script. Please note that additional source may be used that are not inside the package itself (e.g. the popcon database or in the future user-generated meta-data).

The implementation should be flexible as there is more information that can be extracted. Initially we want the desktop file and the icon associated with it. But information about the commands that the file puts into /bin/ and /usr/bin for command-not-found is also interessting.

File format

One (but potentially more in the future) file should be geneated alongside the Packages.gz file to store the additional relevant meta-data from the desktop files. This file should be called "AppInfo" for the C locale and "AppInfo-$lang" for the other locales.

Space and amount of files (because of rsync) are a issues for mirrors. The AppInfo/CommandNotFound data is small and textual, so its no problem to publish it in e.g. rfc822 format alongside Packages file (the extact format does not matter to apt, it will just download it, other apps like update-apt-xapian-index will parse it). But to make it consistent with the rest of data we should simply use RFC-822.

A example file might look like this:

Package: gnome-utils
Version: 2.23.1-0ubuntu1
Popcon: 17939
Section: main
Icon: baobab
Name: Disk Usage Analyzer
Comment: Check folder sizes and available disk space
Exec: baobab
Categories: GTK;GNOME;Utility;

and a localized one:

Package: gnome-utils
Version: 2.23.1-0ubuntu1
Popcon: 17939
Section: main
Icon: baobab
Name-de: Festplatten Überprüfer 
Comment-de: Überprüfen des verfügbaren Platzes
Exec: baobab
Categories: GTK;GNOME;Utility;

Icons are relatively big (~5k/app; currently we have ~1800 icons=7Mb) so its not feasible to stuff them into a single file, especially if we expect a lot of churn (like on extra.ubuntu.com that will also use this system). For this reason, they should be published as individual files on e.g. http://archive.ubuntu.com/ubuntu/dists/maverick/main/icons (or) http://archive.ubuntu.com/ubuntu/dists/maverick/icons

This means that its the job of the client to dynamically fetch the icons from the local mirror and cache them. This is what is done on e.g. android as well.

Overrides

A package may want to override the AppInfo for itself instead of using the desktop file. This should be supported on multiple levels.

It should be possible to blacklist a package via "XB-NoAppInfo: 1" in its control file. This means that the package will not be scanned for desktop files at all.

It should also be possible to override all of the appinfo by having a debian/appinfo file in the source package that overrides the desktop extraction entirely and forces the system to simply use this file. This requires the extraction to look into the source package for this file first (if that is tricky to implement we could move it into the binary to a known path).

And finally if the .desktop file contains "X-AppInfo" fields already, the extractor should honor those and keep them. This is useful if e.g. X-AppInfo-Package is wrong on extraction. If a desktop file lifes in "wesnoth-common" but the package we want is "wesnoth" this is a good way to override it.

Extraction

The current data extractor can be found in lp:~mvo/archive-crawler/mvo A similar (but more clever) approach as this script may be taken to gather the data.

On a soyuz machine with a full mirror the script runs every hour (or two hours) and checks with the DB what new deb packages are available since it ran last. Those are fetched and inspected and written to a local sqlite database (or the LP database) and the icons are extracted and stored as well. Because all the data can be rebuild by simply running the extraction again its probably enough to have a local cache db. Then the script generates the AppInfo and AppInfo-$lang files and populates the icon directory. In this step it also needs to do orphan clenaup, i.e. removing packages that are no longer in the archive. A rsync cron job is then required to rsync the generated file onto archive.ubuntu.com.

Icon extraction can be tricky as the icon may be stored in a different package than the desktop file (e.g. emacs-common vs emacs23).

Stuff we want to extract:

  • desktop file:
    • appname (potentially multi language)
    • packagename
    • pkgversion (to ensure we can validate our data is not stale)
    • Comment (friendly summary) - multi language
    • popcon - probably not needed anymore once we have ratings
    • keywords (X-AppInstall-Keywords)

    • iconname
    • Categories
    • mime-type
    • codec-info
  • Icons
    • as individual files that get dynamically fetched
  • command not found data
    • packagename -> binaries

    • PROBLEM diverts etc, real world problem for e.g. vim

Roadmap

  • Start with non-localized data and only for PPAs
  • Expand to the full archive
    • iteratively survey the differences between app-install-data and the metadata Soyuz is producing
    • fix bugs in the packages and/or in Soyuz

Issues

  • A .desktop file is in a separate package from the package you're actually interested in
    • e.g. wesnoth-data vs. wesnoth
    • e.g. emacs-common contains the icon for emacs22
    • this should be fixed in the packages themselves with the override mechanism
  • A icon file is in a seperate package from the desktop file
    • check how common that actually is
  • Keywords/comments should be available in rosetta for translation, ideally as a additional template in the ubuntu package translation page
  • one pkg may contain multiple apps, app names are not uniq (e.g. Terminal is used multiple times)

Open issues

  • need to disuss and create a standard for what kind of metadata can be supplied within the package and finalize the override mechanism
    • debian/control modifications
    • debian/something.desktop with X-App-Install tags
    • debian/something.something for command-not-found hints
  • provide a way (LP/external site) to allow easy modifications/cleanup of metadata (improve descriptions, improve categories etc)
  • Things this metadata might contain in the future:
    • hardware/software requirements (opengl etc)
    • whether it's available in my language (may change after package upload since translation packages are different)

Success

We will we know we are done when the app-install-data-ubuntu and app-install-data-commercial packages are removed from the Ubuntu archive.

ArchiveIndex (last edited 2010-11-04 10:31:26 by mpt)