PyPI - the official Python Package Index is sometimes on its limits and times out. This can happen and I'am sure people do best work to keep it up and running. But from a company-perspective its good to always have the files from PyPI available.
So we need to mirror. And here we need a full mirror. So this includes packages not hosted on PyPI, but on third party servers. In past those links to externally hosted packages made major problems. And yes - murphys law - the foreign server is down when its needed urgent.
Existing Software
So what do we have?
- pep381client (see PEP 381 'Mirroring infrastructure for PyPI'),
- z3c.pypimirror (see also its project page),
- collective.eggproxy a caching proxy for eggs from eggservers,
- yopypi self balancing instance that will redirect your PYPI request when PYPI is down to a default (or predefined) PYPI mirror.
Maybe theres more but thats my most important findings.
pep381client sounds good, sounds official. But it really creates a more or less 1:1 mirror of PyPI. Good? Yes - it is what I expect from a mirror. But not if you want externally hosted packages mirrored as well. But thats exactly what we need for our use-case.
z3c.pypimirror mirrors PyPI plus externally hosted packages and also follows externally hosted index pages. It supports incremental updates. Good? It's more than a mirror, because it aggregates packages - and yes, its exact what our use-case is.
collective.eggproxy is an caching proxy, so it caches only requested eggs. It's nice to speed up local development, but not sufficient for production servers.
yopipy is a nice helper if you usally want to query official PyPI but fallback to a mirror if PyPI has problems.
Setting up z3c.pypimirror
So we decided to set up z3c.pypimirror. After hitting some problems with externally hosted packages I contacted the authors and got write access on the Launchpad project to fix these problems. I released version 1.0.16 and everything described below works with this version.
First: I'am buildout addicted and so here it is: The buildout to set up my mirror, here the buildout.cfg
[buildout] parts = mirror mirror-cfg [mirror] recipe = zc.recipe.egg:scripts eggs = z3c.pypimirror [dirs] recipe = z3c.recipe.mkdir mirror-base = PATH/TO/mirror mirror-files = ${:mirror-base}/files paths = ${:mirror-files} [mirror-cfg] recipe = collective.recipe.template input = ${buildout:directory}/pypimirror.cfg.in output = ${buildout:directory}/pypimirror.cfg url = http://pypi.MYDOMAIN.TLD mirror-path = ${dirs:mirror-files} lockfile = ${buildout:directory}/mirror.lock logfile = ${dirs:mirror-base}/mirror.log
And the configuration template pypimirror.cfg.in:
[DEFAULT] # the root folder of all mirrored packages. # if necessary it will be created for you mirror_file_path = ${:mirror-path} # where's your mirror on the net? base_url = ${:url} # lock file to avoid duplicate runs of the mirror script lock_file_name = ${:lockfile} # days to fetch in past on update fetch_since_days = 1 # Pattern for package files, only those matching will be mirrored filename_matches = *.zip *.tgz *.egg *.tar.gz *.tar.bz2 # Pattern for package names; only packages having matching names will # be mirrored package_matches = * # remove packages not on pypi (or externals) anymore cleanup = True # create index.html files create_indexes = True # be more verbose verbose = True # resolve download_url links on pypi which point to files and download # the files from there (if they match filename_matches). # The filename and filesize (from the download header) are used # to find out if the file is already on the mirror. Not all servers # support the content-length header, so be prepared to download # a lot of data on each mirror update. # This is highly experimental and shouldn't be used right now. # # NOTE: This option should only be set to True if package_matches is not # set to '*' - otherwise you will mirror a huge amount of data. BE CAREFUL # using this option!!! external_links = True # similar to 'external_links' but also follows an index page if no # download links are available on the referenced download_url page # of a given package. # # NOTE: This option should only be set to True if package_matches is not # set to '*' - otherwise you will mirror a huge amount of data. BE CAREFUL # using this option!!! follow_external_index_pages = False # logfile log_filename = ${:logfile}
Add a bootstrap.py to the directory and run
python2.6 bootstrap.py ./bin/buildoutNow take some time, bandwith and ~16GB harddisk space and run the initial mirror. If you're on a remote server over ssh run it in background or - like I do - run in a screen
. If the process stops for some reason just re-run it, it won't download packages twice.
./bin/pypimirror -I -c -v pypimirror.cfg
Finally add a cron-job to fetch the updates - i.e. every two hours, like so:
* 0-23/2 * * /PATH/TO/pypimirror/bin/pypimirror -U /PATH/TO/pypimirror/pypimirror.cfg
Now a webserver is needed. I took nginx and added a site to /etc/nginx/sites-enabled:
server { listen IPADDRESS; server_name pypi.MYDOMAIN.TLD; location / { root /PATH/TO/mirror/files; } }
Reload nginx and done! The mirror is ready.
Using the mirror
I use buildout almost everywhere. To use the mirror simple add one line to your buildouts main section:
[buildout] .... index = http://pypi.MYDOMAIN.TLD ...
For more information consult the official docs.
Image Green Tree Python by nasmac (Ian C) under a CC-License edited by Jens Klein