CCP14 Homepage - Single Crystal and Powder Diffraction - Auto-Mirrored Web/FTP Sites- WGET software for FTP and Web Auto-mirroring

[CCP14 Home: (Frames \| No Frames)] CCP14 Mirrors: [UK] \| [CA] \| [US] \| [AU]	What's New	Introduction	Site Map
Search the CCP14	Download Programs What do you want to do? (lists of software by crystallographic method)	Tutorials	Solutions

(This Webpage Page in No Frames Mode)

Collaborative Computational Project Number 14

for Single Crystal and Powder Diffraction

CCP14

WGET software for FTP and Web Auto-mirroring

The CCP14 Homepage is at http://www.ccp14.ac.uk

Where to get WGET
WGET is freeware/Gnu UNIX based software written by Hrvoje Niksic (E-mail: hniksic@srce.hr). It is used for manual and automatic mirroring of FTP and Web sites. Various versions are available via the following links:

WGET - Web site created by john@futuresguide.com and space provided by Karsten Thygesen

Contact: karthy@kom.auc.dk and karthy@kom.auc.dk
Web SITE
Automirroring Software by Hrvoje Niksic - UNIX and reference to Windows
Original at http://sunsite.auc.dk/wget/
[CCP14 UK Web Mirror] | [Canadian CCP14 Mirror] | [US CCP14 Mirror]

WGET - Automirroring Software by Hrvoje Niksic

Contact: hniksic@srce.hr
FTP SITE
UNIX
Original at ftp://gnjilux.cc.fer.hr/pub/unix/util/wget/
[CCP14 UK Web Mirror] | [CCP14 UK FTP Mirror] | [Canadian CCP14 Mirror] | [US CCP14 Mirror] | [Australian CCP14 Mirror]

WGET 1.5 beta for MS-Windows - compiled by Tim Charron

Contact: tcharron@interlog.com
WEB SITE
MS-Windows
Original at http://www.interlog.com/~tcharron/wgetwin.html

WGET for Windows

Contact: Heiko.Herold@previnet.it
WEB SITE
MS-Windows
Original at http://space.tin.it/computer/hherold/

wGetGUI Windows GUI for wget

Contact: wGetGUI@JensRoesner
WEB SITE
MS-Windows
Original at http://www.jensroesner.de/wgetgui/

WackGet Windows GUI version of WGET

WEB SITE
MS-Windows
Original at http://millweed.com/projects/wackget
Wackget screenshot at http://millweed.com/cgi-bin/pic?shots/wackget.gif
Wackget updates at http://millweed.com/cgi-viewcvs/viewcvs.cgi/*checkout*/WackGet/CHANGES.txt

Using WGET from behind a firewall

http://www.geocities.com/CapeCanaveral/Lab/9991/wget.html

GUI WGET - using a Tcl/Tk Script ("Utter Coolness")
From Paul Rahme (Mr Bogus), South Africa. E-mail: paulrahmel@hotmail.com

wgetshel.tcl (Download a free Tcl/Tk interpreter via: http://sunscript.sun.com/TclTkCore/)
(CCP14 Tcl/Tk mirror at: http://programming.ccp14.ac.uk/ftp-mirror/programming/tcltk/pub/tcl/)

WGET for Grabbing files off VMS FTP Site

Patch and version by:

Jan Prikryl - prikryl@cg.tuwien.ac.at 
         http://www.cg.tuwien.ac.at/staff/JanPrikryl.html
         Institute of Computer Graphics and Visualisation
             Vienna University of Technology, Austria

wget-1.5.4-b4-vms.tar.Z (~720kB)) (obsolete - latest WGET can handle VMS servers)
wget-1.5.4-b4-vms.tar.gz (~470kB)) (obsolete - latest WGET can handle VMS servers)

Possible Alternatives to WGET - PAVUK, FWGET and LFTP

PAVUK - FTP and HTTP Mirroring
- Can do deletion of obsolete files as well as other features. A possible alternative to WGET.
- Refer to some comparison to wget vx pavuk on the WGET mailing list
- At http://www.idata.sk/~ondrej/pavuk/about.html

FWget - FTP and HTTP Mirroring
- Personal update of WGET. A possible alternative to WGET.
- Refer to http://bay4.de/FWget/
- "FWget is based on Wget, but I have replaced some of the linked lists with hashtables. Wget has got serious problems retrieving huge sites, FWget hasn't. That's about all about it. Version 1.5.3.1 contains some fixes I have not reviewed."

LFTP
- "LFTP is sophisticated ftp/http client, file transfer program. Like BASH, it has job control and uses readline library for input. It has bookmarks, built-in mirror, can transfer several files in parallel. It was designed with reliability in mind"
- At http://lftp.yar.ru/
- Download LFTP: http://lftp.yar.ru/get.html
- LFTP Binaries: ftp://ftp.thewrittenword.com/packages/free/
- RPMs for i386 and also experimental cygwin binary:: http://ftp.yars.free.net/lftp/binaries/
- Read-only CVS access is available. You can use

Optimising and Compiling WGET

By default, WGET converts the ~ (tilder) into a numeric equivalent which can create a bit of havoc when trying to do auto-mirroring like this. To have it leave the ~ as is, the following is done (thanks to Hrvoje on mentioning how to do this):

For the obsolescent WGET 1.5.3

Extract the files from the gz file with the command

gzip -d < wget-version.tar.gz | tar xvof -
i.e.,  gzip -d < wget-1.5.3.tar.gz | tar xvof -

Before compiling a new version of WGET, edit the url.c file

Remove the ~ from the following i.e.,:

#ifndef WINDOWS
# define URL_UNSAFE " <>\"#%{}|\\^~[]`@:\033"
#else
# define URL_UNSAFE " <>\"%{}|\\^[]`\033"
#endif

To:

#ifndef WINDOWS
# define URL_UNSAFE " <>\"#%{}|\\^[]`@:\033"
#else
# define URL_UNSAFE " <>\"%{}|\\^[]`\033"
#endif

For the latest WGET 1.7.1 and later

Extract the files from the gz file with the command

gzip -d < wget-version.tar.gz | tar xvof -
i.e.,  gzip -d < wget-1.8.1.tar.gz | tar xvof -

Before compiling a new version of WGET, edit the url.c file

Change the following from (The U corresponding to the ~ with a 0 (a zero)):

const static unsigned char urlchr_table[256] =
{
  U,  U,  U,  U,   U,  U,  U,  U,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
  U,  U,  U,  U,   U,  U,  U,  U,   /* BS  HT  LF  VT   FF  CR  SO  SI  */
  U,  U,  U,  U,   U,  U,  U,  U,   /* DLE DC1 DC2 DC3  DC4 NAK SYN ETB */
  U,  U,  U,  U,   U,  U,  U,  U,   /* CAN EM  SUB ESC  FS  GS  RS  US  */
  U,  0,  U,  U,   0,  U,  R,  0,   /* SP  !   "   #    $   %   &   '   */
  0,  0,  0,  R,   0,  0,  0,  R,   /* (   )   *   +    ,   -   .   /   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* 0   1   2   3    4   5   6   7   */
  0,  0,  U,  R,   U,  R,  U,  R,   /* 8   9   :   ;    <   =   >   ?   */
 RU,  0,  0,  0,   0,  0,  0,  0,   /* @   A   B   C    D   E   F   G   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* H   I   J   K    L   M   N   O   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* P   Q   R   S    T   U   V   W   */
  0,  0,  0,  U,   U,  U,  U,  0,   /* X   Y   Z   [    \   ]   ^   _   */
  U,  0,  0,  0,   0,  0,  0,  0,   /* `   a   b   c    d   e   f   g   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* h   i   j   k    l   m   n   o   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   s    t   u   v   w   */
  0,  0,  0,  U,   U,  U,  U,  U,   /* x   y   z   {    |   }   ~   DEL */

To:

const static unsigned char urlchr_table[256] =
{
  U,  U,  U,  U,   U,  U,  U,  U,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
  U,  U,  U,  U,   U,  U,  U,  U,   /* BS  HT  LF  VT   FF  CR  SO  SI  */
  U,  U,  U,  U,   U,  U,  U,  U,   /* DLE DC1 DC2 DC3  DC4 NAK SYN ETB */
  U,  U,  U,  U,   U,  U,  U,  U,   /* CAN EM  SUB ESC  FS  GS  RS  US  */
  U,  0,  U,  U,   0,  U,  R,  0,   /* SP  !   "   #    $   %   &   '   */
  0,  0,  0,  R,   0,  0,  0,  R,   /* (   )   *   +    ,   -   .   /   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* 0   1   2   3    4   5   6   7   */
  0,  0,  U,  R,   U,  R,  U,  R,   /* 8   9   :   ;    <   =   >   ?   */
 RU,  0,  0,  0,   0,  0,  0,  0,   /* @   A   B   C    D   E   F   G   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* H   I   J   K    L   M   N   O   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* P   Q   R   S    T   U   V   W   */
  0,  0,  0,  U,   U,  U,  U,  0,   /* X   Y   Z   [    \   ]   ^   _   */
  U,  0,  0,  0,   0,  0,  0,  0,   /* `   a   b   c    d   e   f   g   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* h   i   j   k    l   m   n   o   */
  0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   s    t   u   v   w   */
  0,  0,  0,  U,   U,  U,  0,  U,   /* x   y   z   {    |   }   ~   DEL */

NOTE: On SGI IRIX 6.5.x, it can be dangerous using gcc to compile at WGET as strange things can happen due to bugs in gcc vs IRIX. (e.g., it can start mirroring the destination server on top of itself).
For SGI IRIX, config the system for installing on your home area by running the following via Bash shell to force the compilation to occur via cc (not gcc):
CC=cc ./configure
Then compile wget (the executable is created in the src directory) by typing make
Check that the ~ mod works by starting a grab on a user site (e.g., wget http://www.chem.gla.ac.uk/~louis/). Check that it has created the user directory as ~louis.

The .wgetrc Config File

After compiling WGET, you need to install a .wgetrc file in your home area. The WGET manual file goes into detail about options and how to possibly setup the .wgetrc file. Main things to note is which vary from the default.

follow_ftp = off
recursive = on
dirstruct = on
robots = on

(To disable robots.txt via the command line, add --execute robots=off)

Click here to view the simple CCP14 .wgetrc file.

WGET Manual Files

(from wget-1.5.3 - September 1998)

Windows Help File format

(In Zip format - from wget-1.5 - April 25th 1998)

http://www.sunsite.auc.dk/wget/wgethelp.zip | [CCP14 Mirror]

Setting Up Script Files and Cron Job for Automirroring

Automirroring is still under Construction. An alpha version of a WGET that is VMS friendly is being used to automatically mirror VMS FTP servers. (affects GSAS, Larry Finger FTP Site and Profil).

Automirroring run daily (or if less frequent mirroring is request by remote web-master) using cron (~1 to ~5am UK time). Depending on the site, some customisation can be required. Either to link the web to the ftp site, or overcome "non-recommended" html in the files that WGET is not happy with. The reason for the above logging times is that grabbing files over the US link (from the UK) incurrs a charge (~2pence per Meg) but there is a free time between 1am and 5am. Thus automirroring is rigged to occur between these hours - though even if done outside these hours, the amount of material coming over for incremental mirroring is quite trivial. Just before 5am, all wget jobs are killed and the automatically emailed summary log is quickly examined to see which jobs did not complete. These can be swapped around so they do complete within the required time.

The cron file contains the list of script files that are run each morning. As recommended by the local cron guru, a .crontab file exists in the home directory which is freely edited. To enable cron to use this updated file, the following command is invoked: crontab .crontab. It is possible to inadvertantly corrupt a crontab file (i.e., crontab -r means remove - not read!). The default UNIX line editor can cause some mayhem for the unwary when crontab -e is run.

#start these auto-mirroring jobs up
01 01 * * * cp web_area/mirrorbin/logs/report-proforma.txt web_area/mirrorbin/logs/report.txt
31 01 * * * web_area/mirrorbin/wget.script.alife
32 01 * * * web_area/mirrorbin/wget.script.armel
33 01 * * * web_area/mirrorbin/wget.script.asia
34 01 * * * web_area/mirrorbin/wget.script.australia
35 01 * * * web_area/mirrorbin/wget.script.cguitools
36 01 * * * web_area/mirrorbin/wget.script.cguitools.wxwindows
37 01 * * * web_area/mirrorbin/wget.script.commercial
38 01 * * * web_area/mirrorbin/wget.script.europe.one
39 01 * * * web_area/mirrorbin/wget.script.europe.two
10 01 * * * web_area/mirrorbin/wget.script.generalcode
11 01 * * * web_area/mirrorbin/wget.script.gnu
12 01 * * * web_area/mirrorbin/wget.script.gnu-site
13 01 * * * web_area/mirrorbin/wget.script.gnu.mumit-khan
14 01 * * * web_area/mirrorbin/wget.script.netlib
15 01 * * * web_area/mirrorbin/wget.script.northamerica.one
16 01 * * * web_area/mirrorbin/wget.script.northamerica.two
17 01 * * * web_area/mirrorbin/wget.script.southamerica
18 01 * * * web_area/mirrorbin/wget.script.swarm
19 01 * * * web_area/mirrorbin/wget.script.tcltk
20 01 * * * web_area/mirrorbin/wget.script.uk
21 01 * * * web_area/mirrorbin/wget.script.utility
22 01 * * 0 web_area/mirrorbin/wget.script.weekly
23 01 * * * web_area/mirrorbin/wget.script.wget
05 05 * * * /usr/sbin/Mail -s "Mirroring_Results `date`" ccp14@dl.ac.uk < web_area/mirrorbin/logs/report.txt

Cron invokes the commands from the home directory (~/). Thus everything must be relative to the home directory.
The columns stand for when the command is run:
- First column is the minute
- Second column for the hour
- Third column for the day of the month (* means run on each day of the month)
- Forth column for the month of the year (* means run on each month of the year)
- Fifth column for the day of the week (0 is a Sunday, * means run on each day of the week)

To mirror both the FTP and Web area of a site, the following type of system is defined in an ASCII Script file. In this case, for grabbing the ORTEP site.

To internally link the html files to point correctly to the local ftp area, the rather kludgy cshell scripts are run after WGET.

After each WGET grab, it's completion time is logged to the report.txt file which at the end of all the wget sessions, is mailed for easy and convenient viewing to see that all jobs ran.

#!/bin/csh
# You should CHANGE THE NEXT 3 LINES to suit your local setup
setenv  LOGDIR   ./web_area/mirrorbin/logs    # directory for storing logs
setenv  PROGDIR  ./web_area/mirrorbin         # location of executable
setenv  PUTDIR   ./web_area/web_live/ccp      # relative directory for mirroring

#FTP Ortep 3 FTP Site
#  E-mail: ortep@ornl.gov (Dr. Michael N. Burnett)
$PROGDIR/wget -nH -r -N -nr -l0 -k -np -X /cgi-bin --cache=off \
 ftp://ftp.ornl.gov/pub/ortep/ \
 -P $PUTDIR/ccp14/ftp-mirror/ornl-ortep \
 -o $LOGDIR/northamerica.one.log 

set DATE=(`date`)
sed "/ORNL_Ortep_FTP/s/NOT_finished/COMPLETED $DATE/" $LOGDIR/report.txt  > $LOGDIR/report.txt.new
mv $LOGDIR/report.txt.new $LOGDIR/report.txt


#WEB Ortep 3 Web Site
#  E-mail: ortep@ornl.gov (Dr. Michael N. Burnett)
$PROGDIR/wget -nH -r -N -nr -l0 -np -X /cgi-bin --cache=off \
 http://www.ornl.gov/ortep/ortep.html \
 http://www.ornl.gov/ortep/topology.html \
 -P $PUTDIR/web-mirrors/ornl-ortep \
 -o $LOGDIR/northamerica.one.log 


foreach f ($PUTDIR/web-mirrors/ornl-ortep/ortep/*.html)
sed -e 's+=\"http://www.ornl.gov/ortep+=\".p+g' < $f > $f.tmp
mv $f.tmp $f
end

foreach f ($PUTDIR/web-mirrors/ornl-ortep/ortep/*.html)
sed -e 's+\"ftp://ftp.ornl.gov+\"./../../../ccp14/ftp-mirror/ornl-ortep+g' < $f > $f.tmp
mv $f.tmp $f
end

foreach f ($PUTDIR/web-mirrors/ornl-ortep/ortep/topology/*.html)
sed -e 's+\"ftp://ftp.ornl.gov+\"./../../../../ccp14/ftp-mirror/ornl-ortep+g' < $f > $f.tmp
mv $f.tmp $f
end

foreach f ($PUTDIR/web-mirrors/ornl-ortep/ortep/examples/*.html)
sed -e 's+\"ftp://ftp.ornl.gov+\"./../../../../ccp14/ftp-mirror/ornl-ortep+g' < $f > $f.tmp
mv $f.tmp $f
end

set DATE=(`date`)
sed "/ORNL_Ortep_Web/s/NOT_finished/COMPLETED $DATE/" $LOGDIR/report.txt  > $LOGDIR/report.txt.new
mv $LOGDIR/report.txt.new $LOGDIR/report.txt

Defining what each WGET option does

(Credit goes to the IUCr Technical mirroring page at http://www.iucr.org/iucr-top/docs/mirror/)

-nh - (This option has been deleted in wget 1.8.x) Do not perform DNS lookup on the host name (speeds things up)
However, for where a webpage has absolute computer names that are really the same computer (i.e., wserv1.dl.ac.uk and www.dl.ac.uk), disable this so that it can check if these pages do belong on the same computer and can grab them. DNS lookup is enabled on the Sirware site to in theory make sure it treats www.ba.cnr.it and area.ba.cnr.it as identical webservers.
```
#Sirware WebSite
#  E-mail: cryst@area.ba.cnr.it
$PROGDIR/wget -nH -r -N -nr -l0 -k -np -X /cgi-bin --cache=off  \
 http://www.ba.cnr.it/IRMEC/SirWare_main.html \
 http://www.ba.cnr.it/IRMEC/ \
 -P $PUTDIR/web-mirrors/sirware \
 -o $LOGDIR/europe.log
```
-nH - Disable generation of host name prefixes (stop the creation of host subdirectory names - i.e, ./pub/ortep instead of ./ftp.ornl.gov/pub/ortep/)
-r - recurse down the subdirectory tree
-R - reject files (e.g., -R .listing)
-N - turn on time stamping to help enable incremental updates of files
-nr - retain .listing files for FTP (enables incremental updates of ftp areas)
-l0 - recursively fetch files to infinite depth (-l3 would tell wget to recursively go 3 levels deep)
-np - "no parents" - only recurse down the directory tree - never fetch a parent of the root directory.
-X - reject subdirectories with this name.
-A - accept files with this text (i.e., -Aexe grabs all *.exe files, but nothing else)
-I - accept subdirectories with these extensions (this is an alternative to using -np "no parents". Is used where icons or images may be kept in a completely different part of the web server.
-k - try and convert absolute links into "working" relative links. (not 100% efficient in the present version of WGET).
-K - (WGET 1.7 and later) --backup-converted' When converting a file, back up the original version with a `.orig' suffix. Affects the behavior of `-N' (*note HTTP Time-Stamping Internals::).
-P - tell WGET where to put the downloaded files. (Using relatively directory paths can be helpful as WGET can do strange things when specifying absolute directory paths)
-o - tell WGET where to put the screen output to. If you don't specify this, the cron job will give you this output in an email message (NB: using the append command does not work with latest version of WGET as output is now considered to be stderr - thus going via email if invoked via cron)
-D - Only grab files from the defined domains. This is required when for some strange reason -nh does not work. In this case, for the www.netlib.org site which also starts grabbing the university (.edu) web-site that is hosting this area. In this case use -Dnetlib.org as it hosts other org bodies.
--cache=off - Will get around problems of caches reporting cached versions. (When set to off, disable server-side cache. In this case, Wget will send the remote server an appropriate directive (`Pragma: no-cache') to get the file from the remote service, rather than returning the cached version. This is especially useful for retrieving and flushing out-of-date documents on proxy servers. Caching is allowed by default.)

-debug - debug mode - for when you have to send a bug report in

#Netlib - Code Library
#  E-mail: ehg@research.bell-labs.com (Eric Grosse)
$PROGDIR/wget -nh -Dnetlib.org -nH -r -N -nr -l0 -k -np  --cache=off \
   -X /cgi-bin,/~icl,/utk  \
 http://www.netlib.org/ \
 -P $PUTDIR/web-mirrors/programming/netlib \
 -o $LOGDIR/programming.log

Efficient Web/FTP Mirroring - using WGET

Efficient HTTP Based Web Mirroring - using WGET

Generally, HTTP based mirroring is "NOT" very efficient for incremental updates as it requires that every web file is "pinged" to check that it has changed. However (presently) if you have to do some processing of these files, or use it's internal links to get off-domain files, it is quite effective. On small sites, HTTP based mirroring is easy to do and not a problem, but on large sites, it can take most of the day to do a daily or weekly incremental mirror.

Efficient FTP Based Web/FTP Mirroring - using WGET

FTP mirroring can be far more efficient for mirroring sites as to get an incremental update of a web or ftp site, a directory listing is all it requires to check whether there have been updated files. (compared to http mirroring where it has to ping every file to do this check)

While FTP based mirroring can be an order or magnitude or more efficient than HTTP based mirroring; one downside with ftp mirroring is that WGET cannot presently interrogate the html files grabbed by FTP to convert any absolute links to relative links, etc. This method also assumes all the files are available by grabbing the FTP tree. For web areas not accessible via public/anonymous FTP, just use password protected access.

#Armel Le Bail's WEB Site - via password protected FTP
#  E-mail: armel@fluo.univ-lemans.fr (Armel LeBail)
$PROGDIR/wget -r -nH --cut-dirs=4  -N -nr -l0 -np  --cache=off \
 ftp://username:password@199.199.199.199//home/armel/doobry/webdocs \
  -P $PUTDIR/web-mirrors/armel \
  -o $LOGDIR/armel.log

Notice how the --cut-dirs=4 is used to make 4 level deep directory that houses the web pages look like the root area on the web at mirrored pages ../ccp/web-mirrors/armel/ Otherwise the mirroring would be at ../ccp/web-mirrors/armel/home/armel/doobry/webdocs

It is also possible to create an account where all that can be performed is read-only ftp and the web-area to be mirrored also looks like an root ftp area, as though it was an anonymous ftp area, but with password access restriction. On a SGI IRIX system, this can initially be a fiddly job to set up but once you know how to do this (preferably by having access to a local UNIX/network admin guru) (man ftpd explains how to set this up in quite some detail).

Create a new user (e.g., ftpmirror) whose home area is the root area of the web-files (e.g., /web_disk/ccp14/web_area)
Create a /ftpusers file with the appropriate restrictions
```
ftpmirrror restrict
```
(refer to man ftpd for how this may work on your varient of UNIX). This means that when logged in as ftpmirror, the system will do a chroot
Set various permissions (which I am going to have to get the info on)
Create an appropriate FTP set of subdirectories with ownership of root and group owner of sys. Locate the appropriate functionality within these directories, such as ls to be able to do directory lists. In the subdirectories, . is owned by root with group of sys. .. is owned (as example) by ccp14 with group of dlccp14a. (again, refer to man ftpd for how this may work on your varient of UNIX)
- Make sure to read the "man ftpd" as this may vary with versions of the operating system.
As people may try to browse these ftp config files (i.e., http://sv1.ccp14.ac.uk/bin/), redirect them to an appropriate page via the web server redirect procedure. In the case of the Apache web server, something like the following in the /usr/local/etc/httpd/conf/httpd.conf file:
```
RewriteEngine on
RewriteRule ^/bin/(.*) http://www.ccp14.ac.uk/bad-link.html [R]
RewriteRule ^/dev/(.*) http://www.ccp14.ac.uk/bad-link.html [R]
RewriteRule ^/etc/(.*) http://www.ccp14.ac.uk/bad-link.html [R]
RewriteRule ^/lib/(.*) http://www.ccp14.ac.uk/bad-link.html [L,R]
RewriteRule ^/lib32/(.*) http://www.ccp14.ac.uk/bad-link.html [L,R]
```
This may not be the 100% security kosher way of doing this, but is also cited on the main apache newsgroups as a nifty trick to do.

ULTRA Efficient netlib style Co-operative Mirroring - NOT Presently in WGET

The basis of this system is that the server being mirrored co-operates in providing information on what files have been modified over time.

Refer to the following published article on this.

From ehg@research.bell-labs.com  Thu Jun 11 23:41:38 1998
Date: Thu, 11 Jun 1998 18:30:54 -0400
To: L.M.D.Cranswick@dl.ac.uk
From: "Eric Grosse" <ehg@research.bell-labs.com>
Subject: re: More Detailed Spec on your Mirroring Method?

There is the paper
 Article Grosse:1995:RM,
 author =       "Eric Grosse",
 title =        "Repository Mirroring",
 journal =      "ACM Trans. Math. Software",
 volume =       "21",
 number =       "1",
 pages =        "89--97",
 month =        mar,
 year =         "1995",
 CODEN =        "ACMSCU",
 ISSN =         "0098-3500",
 bibdate =      "Tue Apr 25 15:42:31 1995",
 URL =          "ftp://netlib.bell-labs.com/netlib/crc/mirror.ps.gz",
 keywords =     "C.2.4 [Computer-Communication Networks]: Distributed
 Systems -- distributed databases",
 subject =      "archives; checksum; distributed administration;
 electronic distribution; ftp"

The basic idea is to publish (as an ordinary file on the web/ftp site)
a simple listing of the entire collection as
   pathname unixtime bytes checksum
and then provide utilities to compare two such files, generating
commands to bring the "slave" into sync with the "master."  The
customary shell or perl tools can be used to filter these files
for complicated master/slave relationships, or you can use it
straight, for straightforward mirroring.

If I were redoing this today, I might use persistent HTTP in place of
ftp, and might merge netlib's MD5 files into these "crc" files.  But
that would be pushing farther forward than some people want to go
just yet;  ftp and crc are still the right conservative choice.

You're welcome to repost this to the wget mailing list if you like.

Eric

rsync style Co-operative Mirroring - NOT Presently in WGET

While not present in WGET yet, another option for high effeciency co-operative mirroring is the rsync software. The server you wish to mirror must have an rsync server running. These are not difficult to install and configure.

Rsync homepage - http://rsync.samba.org
Compiling and Installing Rsync

Logging Details of Nightly Incremental Updates

To keep control on the incremental nightly mirroring, script files log the time when a backup has finished into a single file which is the emailed each morning to the operator. This makes any problems that have occured the evening before quite obvious.

A report-proforma.txt text file is kept up to date with sites being mirrored. At the start of the cron job, a new report.txt file is created.

01 01 * * * cp web_area/mirrorbin/logs/report-proforma.txt web_area/mirrorbin/logs/report.txt

Script file run after each incremental WGET script is of the following type:

#ANSTO LHPM Rietveld FTP Area
#  E-mail: bah@ansto.gov.au (Brett Hunter)
$PROGDIR/wget -nH -r -N -nr -l0 -np -X /cgi-bin \
  ftp://ftp.ansto.gov.au/pub/physics/neutron  \
  -P $PUTDIR/ccp14/ftp-mirror/ansto \
  -o $LOGDIR/australia.log 

set DATE=(`date`)
sed "/ANSTO_LHPM_FTP/s/NOT_finished/COMPLETED $DATE/" $LOGDIR/report.txt  > $LOGDIR/report.txt.new
mv $LOGDIR/report.txt.new $LOGDIR/report.txt

This substitutes the relevant "NOT_finished" with "COMPLETE and the date-time" such that un-finished jobs due to network/domain problems are relatively obvious.

EUROPE_ONE
Fullprof_PLOTR_FTP                  has COMPLETED Thu Aug 6 01:38:12 BST 1998
PLOTR_Web                           has COMPLETED Thu Aug 6 01:38:28 BST 1998
Simref_Simpro_MEED                  has COMPLETED Thu Aug 6 01:38:47 BST 1998
FIT2D_FTP                           has COMPLETED Thu Aug 6 01:38:49 BST 1998
FIT2D_Web                           has COMPLETED Thu Aug 6 01:41:32 BST 1998
Jana_FTP                            has COMPLETED Thu Aug 6 01:41:55 BST 1998
Jana_Web                            has NOT_finished
Stefan_Krumm_FTP                    has COMPLETED Thu Aug 6 01:42:40 BST 1998
Stefan_Krumm_Web                    has COMPLETED Thu Aug 6 01:43:27 BST 1998
XND_Rietveld_FTP                    has COMPLETED Thu Aug 6 01:43:33 BST 1998
XND_Rietveld_Web                    has COMPLETED Thu Aug 6 01:44:02 BST 1998
Sirware_Web                         has COMPLETED Thu Aug 6 01:45:58 BST 1998
XPMA_Zortep_Web                     has COMPLETED Thu Aug 6 01:46:00 BST 1998
DMPLOT_FTP                          has COMPLETED Thu Aug 6 01:46:13 BST 1998
DIRDIF_FTP                          has COMPLETED Thu Aug 6 01:46:20 BST 1998
CRUNCH_Web                          has COMPLETED Thu Aug 6 01:47:17 BST 1998
DIRDIF_Web                          has COMPLETED Thu Aug 6 01:47:45 BST 1998
DRXWIN_FTP                          has COMPLETED Thu Aug 6 01:47:47 BST 1998
DRXWIN_Web                          has COMPLETED Thu Aug 6 01:48:56 BST 1998
AXES_FTP                            has COMPLETED Thu Aug 6 01:49:02 BST 1998
BGMN_Web                            has COMPLETED Thu Aug 6 01:49:53 BST 1998
ORTEX_Suite_Web                     has COMPLETED Thu Aug 6 01:50:03 BST 1998

All jobs are terminated before 5am (UK time), and the report is them mailed to the operator using the following command in cron

05 05 * * * /usr/sbin/Mail -s "Mirroring_Results `date`" ccp14@dl.ac.uk < web_area/mirrorbin/logs/report.txt

[CCP14 Home: (Frames \| No Frames)] CCP14 Mirrors: [UK] \| [CA] \| [US] \| [AU]	What's New	Introduction	Site Map
Search the CCP14	Download Programs What do you want to do? (lists of software by crystallographic method)	Tutorials	Solutions

(This Webpage Page in No Frames Mode)

If you have any queries or comments, please feel free to contact the CCP14