#! /usr/bin/perl

=head1 NAME

crawl-web-for-images.pl


=head1 SYNOPSIS

  # Start with the URLS in seed_urls.txt and write a verbose logfile.
  crawl-web-for-images.pl  --input-file=seed_urls.txt --logfile=getting-images.log --verbose --timestamps

  # Quit after downloading 100 images or after running for about two
  # hours, whichever is sooner.
  crawl-web-for-images.pl -i seed_urls.txt --image-count=100 --timelimit=2h

  # Download about ten images per day until the script is killed.
  crawl-web-for-images.pl -i seed_urls.txt --daily=10 

  # Specify a target directory to download twenty pictures a day to.
  crawl-web-for-images.pl -i seed_urls.txt --dir /home/user/Pictures/random -D 20

  # Only download images if their area in square pixels is at least 1600.
  crawl-web-for-images.pl -i seed_urls.txt --area=1600 

  # Wait an average of an hour between downloads, varying randomly.
  crawl-web-for-images.pl -i seed_urls.txt --wait=1h --random-wait

  # The same, but don't do exponential backoff after errors since our baseline
  # weight is already pretty long.
  crawl-web-for-images.pl -i seed_urls.txt --w 1h -r --noexponential

  # Exit the first time we have a problem downloading or saving anything.
  crawl-web-for-images.pl -i seed_urls.txt --max-errors=1

  # Apply a set of regular expression weights to all URLs found.
  crawl-web-for-images.pl -i seed_urls.txt --weights=weights.tab


=head1 DESCRIPTION

crawl-web-for-images.pl will randomly crawl the web, starting with one
or more specified seed URLs, and download random image files.  It will
generally visit several different sites, gathering image URLs from
<img> tags, before it starts downloading any actual images.
As it gathers more image tag URLs, it will download images relatively more
often and HTML pages relatively less often.  It won't save the HTML
pages to disk, only the images it finds, except in debug mode where
it has some problem parsing the HTML.

It allows you to limit which sites to visit and images to download in several
ways.  However, even with the most carefully constructed seed URL list and
weights, there's no way to be sure that the crawler won't wander into a site with
porn images and download them.  Be sure you're okay with the risk of that before
you run the scripts.

It honors robots.txt and won't visit pages or download images it's not
allowed to.

=head1 OPTIONS


=head2 Where to get images, where to put them

Any command-line arguments that start with 'http' will be treated
as seed URLs.  Alternatively, you can list seed URLs in a text
file with the command line option:

=over

=item -i  --input-file=[filename]

Points to a file containing a list of seed URLs to start our
web crawl with.  See L</FILES> for more detail.

=item -d  --dir=[directory]

The directory to download image files to.  It's also the default directory for
the logfile if a full path isn't specified for it.  Defaults to the current
directory.

If this doesn't exist yet, the script will try to create it.

=back


=head2 How long to run, how many images to get

If none of these options are specified, the crawler will sleep 

=over

=item -c --image-count=[integer]

Exit after downloading this many images.

=item -t --timelimit=[integer]([unit])

Exit after running this amount of time.  The time limit can take an integer
argument (i.e. so many seconds), or it can be an integer followed by a
letter specifying the units.  The units allowed for --timelimit are:

  s	seconds (same as leaving units off)
  m	minutes
  h	hours
  d	days

=item -D --daily=[integer]

This argument represents the approximate number of images to download per day.
The amount of time to pause between attempts to download URLs is set based on
this count, weighted by the number of web pages vs. image URLs the robot has in
its lists of URLs.  Early on when it has only web page URLs, it will download
one every few seconds.  Once it has accumulated some image URLs, it will
download less often, and when it has plenty of image URLs, it will spend most of
its time downloading images and little time downloading more web pages, so the
time to sleep between GET requests will approach a limit of 24 hours / number of
images per day.  Assuming no unsuccessful attempts, it will download roughly
this many images in each 24-hour period, and keep running indefinitely unless
you specify an --image-count or --timelimit option.

=item -w  --wait=[integer]([time unit])

This specified the amount of time to sleep between downloads.
Like the --timelimit, it can take a number of seconds or a number followed
by a time unit (s, m, h or d).

This is ignored if you specify a --daily count, as that will cause the algorithm
to calculate its own sleep time according to the daily image quota and the
changing proportions between the number of web page URLs and image URLs
in memory.

If neither this option nor --daily are specified, the default sleep time is
15 minutes.

=item -r  --random-wait

If this is set, the sleep time will randomly vary between half the time
specified by --wait and one and a half times that long.

=back


=head2 Error handling

=over

=item -e --max-errors=[integer]

Up until this many nonfatal errors, the script will keep trying to work.  After
this many errors, it will quit.  Types of nonfatal errors include HTTP GET
requests failing.  The default maximum error limit is 5.

=item --exponential
=item --noexponential

Exponential backoff is on by default; it means that after an HTTP
request fails, the script will double its wait time before trying
another such request, and continue doubling the wait time until it
either succeeds in getting a page or reaches the maximum errors and
exits.  Once a request succeeds, the wait time is restored to normal.

=back


=head2 Messages and logging

=over

=item -q --quiet

Write nothing to the terminal.

=item -v --verbose

Write extra-detailed messages.

=item -l --logfile=[filename]

Write output to the specified logfile.  Messages will also be written
to the terminal unless --quiet is specified.

=item --debug

Write debug messages, and if there is a problem parsing an HTML file, save the
HTML file to the download directory.

=item -T --timestamps

Include the time and date with each message written to the log file.

=back


=head2 Which images to get

=over

=item --weights=[filename]

Specifies a weights file consisting of regular expressions and weights,
separated by tabs.  If a URL for an image or web page matches one or more of the
regular expressions, the weights will be applied to it in determining how many
copies of the URL are inserted into the list.  A weight of 0 means don't visit
URLs matching that regular expression, 2 means insert two copies into the list,
etc.  If multiple regular expressions match, their weights are multiplied
together.

=item -A  --area=integer

Don't download images if we are able to determine their size and their area is
less than this many square pixels.

=item -H  --height=integer

Don't download images if we are able to determine their size and their height is
less than this many pixels.

=item -W  --width=integer

Don't download images if we are able to determine their size and their width is
less than this many pixels.

=item --or

Download this image if I<either> of the --height or --width criteria are
satisfied.


=back


=head2 Other options

=over

=item -b  --balancing=[algorithm](lowerbound,upperbound)

This option selects the balancing algorithm that determines whether to visit
another web page looking for URLs or download an image from among the image URLs we've
collected during the main event loop.

The algorithm can be one of three strings: linear, log, or equal, followed by
an optional pair of numbers separated by a comma, the lower bound and upper
bound supplied to said algorithm.  

The default algorithm is log, the default lower bound is 4.61 (log 100) and the default
upper bound is 9.21 (log 10000).  You probably don't want to change this option unless
you've read and understood the relevant code in the main function and set_pausetime().
The default is the default for a reason; I just wanted to have a way to switch
between algorithms during development and testing without having to edit the code.


=item --help

Get brief help.

=item -M --manual

Get detailed help.

=back

=cut


=head1 FILES

=head2 Seed URLs file

This can simply be a list of URLs, one per line, but if you want the
crawler to be more likely to visit some of them than others, you can
also specify weights -- an integer separated by whitespace from the
URL.  E.g.,

  http://americangallery.wordpress.com		15
  http://www.americanartarchives.com/artzybasheff.htm	20
  http://goldenagepaintings.blogspot.com/	30

The numbers aren't percentages; they're the number of copies of that URL which
are initially seeded into the list of URLs.  No matter how many copies of a
given URL are seeded, the robot won't visit that URL more than once; the
weight just affects the initial probability of visiting that website rather than
one of the other seed URLs, or (a little later) the probability of visiting that
seed site rather than one of the various URLs gathered from the pages already
visited.

If a line just has an URL with no seed weight, the default weight is 10.

This format also allows comments, beginning with # and continuing to the
end of the line.

=head2 Weights file

If the --weights argument is used, the following filename will be read and
interpreted as a weights file, where each line consists of a regular expression,
a tab, and a weight (a nonnegative number).  If a URL found in a page we visit
matches any of the regular expressions in the weights file, then the
corresponding weight will be applied; 0 means to exclude URLs matching that
regular expression, 0.5 means to give those URLs half the default probability of
being visited, 2 means to give them double the default probability, and so on.
If multiple regular expressions match a single URL, the applied weights are
multiplied. 

(The script basically does this by inserting zero or more copies
of the URL into the list of web pages or images to get, depending on the
weight and a suitable random factor if the weight is not an integer.)  

Some example weights

 ^[a-z]+://[^/]+/?$	10	# give preference to visiting a new site for the first time

 \.webp$	0	# no .webp because Eye of Gnome doesn't support it

 tumblr.*avatar.*png	0
 tumblr\.png	0

 # give higher weight to higher-resolution versions of images on blogspot sites
 blogspot.*s1600	3
 blogspot.*s400	0.5
 blogspot.*s320	0.4

 # block these if they're in the filename (not the domain name)
 # they tend to be banner adverts?
 blogger[^/]*$	0
 yahoo[^/]*$	0
 google[^/]*$	0

 # don't try to edit Wikipedia pages!
 action=edit	0
 action=history	0

 # or submit forms on any site
 submit\?	0
 blogspot\.com/search	0
 delete-comment	0

 (?i)thumbnail	0
 _sm\.(gif|png|jpg)	0
 _small\.(gif|png|jpg)	0


As you can see, I mostly use the weights file to block out undesirable images or
pages that it is not suitable or worthwhile to visit, though in many cases those
would be blocked in the first place by the site's robots.txt, which this script
honors.

See L<WeightRandomList.html> for more information.



=head1 DEPENDENCIES

This script uses the following Perl modules:

C<strict>,
C<warnings>,
C<constant>,
C<Cwd>,
C<Pod::Usage>, and
C<Getopt::Long>
are in the standard library.  

C<LWP::UserAgent>,
C<WWW::RobotRules>, and
C<HTTP::Response>
are available from CPAN.

C<ImageSites.pm> and
C<WeightRandomList.pm>
are included with this script.


=head1 BUGS AND LIMITATIONS

This script doesn't use any image libraries to parse the images downloaded and
make sure they are what the file extension says they are or check the size.
Sometimes you'll get HTML file (mostly some sort of "image not found" page)
saved as a .png or .jpg or whatever file type, and sometimes you'll get a .png
saved as a .jpg or vice versa because the file extension in the <img> tag was
wrong.

So decisions based on --area, --height and --width use heuristics from the <img> tag
attributes and the filename (e.g., looking for a substring like '200x300'),
not the actual image size.

The HTML parsing is very basic, just searching for <a href> and <img>
tags with regular expressions.



=head1 AUTHOR

Jim Henry III, L<http://jimhenry.conlang.org/software/>


=head1 LICENSE

This script is free software; you may redistribute it and/or modify it under
the same terms as Perl itself.



=head1 TO DO LIST

Find a suitable image library and check size, file type, etc. before 
saving images.

Use HTML::Parser to find links and image tags more reliably.

Test alt text and title elements against regexes and apply weights.

Add some sort of whitelist option (another file, or simply a flag to treat the
domain names in the seed URLs file as a whitelist?) or option to look for an
existing blacklist on the web.

=cut

use strict; 
use warnings;

use LWP::UserAgent;
use WWW::RobotRules;
use HTTP::Response;
use Cwd;

use Pod::Usage;
use Getopt::Long;
Getopt::Long::Configure('bundling');

use ImageSites;
use WeightRandomList;

use constant DEFAULT_PAUSE => 900;	# fifteen minutes
use constant DEFAULT_SEED_WEIGHT => 10;
use constant SEC_PER_DAY => 86400;

use constant LOGARITHMIC => 1;
use constant LINEAR => 2;
use constant EQUAL => 3;

my $version = "1.0";

my $balancing_algorithm = LOGARITHMIC;
my $lowerbound = log 100;
my $upperbound = log 10000;
my $balancing_s;

my $starttime = time;
my $pausetime_s;
my $timelimit_s;
my $timelimit = 0;
my $minheight = 0;
my $minwidth = 0;
my $minarea = 0;
my $minima_or = 0;
my $pic_count = 0;
my $max_pic_count = 0;
my $daily_count = 0;

my $timestamps = 0;
my $debug = 0;
my $save_dir;
my $logfile;
my $seed_file;
my $do_exponential_backoff = 1;
my $verbose = 0;
my $quiet = 0;
my $random_wait = 0;
my $max_errs = 5;
my $max_list_size = 50000;
my $weights_file;
my $weights;

# urls of HTML pages and IMG objects which we haven't gotten yet
my @url_list = ();
my @image_list = ();

# Information about IMG tag URLs which we will use when we download them.
# It might make more sense to use a hash of hash references rather than
# two hashes using the same keys (the image URLs).
my %found_in = ();
my %title_text = ();

# these actually list not only those downloaded, but those we've found
# to be disallowed by robots.txt and so refrained from downloading.
# in either case we don't want to re-add them to the lists of things to get
my @page_already_downloaded = ();
my @img_already_downloaded = ();

my %domain_robots_txt = ();
my $script = $0;
$script =~ s(.*/)();
my $robot_rules = WWW::RobotRules->new( $script );

#####


sub help {
	print<<HELP;

$script version $version

Usage:

-q  --quiet                    	Print nothing to the terminal.
-v  --verbose                  	Print extra messages.
-l  --logfile                  	File to print messages to.
-T  --timestamps               	Print timestamp on each message.
-i  --input-file               	File with a list of seed URLs to start from.
-e  --max-errors               	Exit after this many nonfatal errors.
--exponential                  	Do exponential backoff after download failures.
--noexponential                 Don't do exponential backoff after download failures.
-w  --wait                     	Sleep this amount of time between downloads.
-r  --random-wait              	Randomly vary the sleep time between downloads.
--weights                      	Read this file and apply the weights in it.
-d  --dir                      	Target directory to download files into.
-D  --daily                    	Download about this many images per day.
-t  --timelimit                	Exit after this amount of time.
-c  --image-count              	Exit after downloading this many images.
-A  --area                     	Don't download pictures smaller than this.
-H  --height                   	Don't download pictures shorter than this.
-W  --width                    	Don't download pictures narrower than this.
--or                           	Download pictures if either the height or width is okay.
-b  --balancing                	See user manual.
--help                          Print brief help.
-M --manual                     Show user manual.

See the manual for details of what values are accepted by the --wait and
--timelimit options.

HELP
	
}




sub seed_urls {
    my $fn = shift;
    my $fh;
    open $fh, $fn	or die "couldn't open $fn for reading\n";
    while ( <$fh> ) {
	s/^#.*//;		# remove comment on line by itself:
	s/([^\\])#.*/$1/;	# remove comment following other text
	next if m/^\s*$/;
	s/\s+$//;		# remove trailing whitespace
	my ($url, $weight, $otherstuff) = split /\s+/;
	if ( not defined $weight ) {
	    $weight = DEFAULT_SEED_WEIGHT; 
	}
	push @url_list, (($url) x $weight);
    }
	
    close $fh;
}

sub allowed_by_robot_rules {
    my $url = shift;
    my $domain_name = shift;
    die if not defined $url;
    if ( not defined $domain_name ) {
	$domain_name = $url;
	$domain_name =~ s!([a-z]+:[0-9]*//[^/]+)/?.*!$1!;
    }

    if ( not defined $domain_robots_txt{ $domain_name } ) {
	my $robots_txt = get_page( "$domain_name/robots.txt" );
	if ( defined $robots_txt ) {
	    $robot_rules->parse( "$domain_name/robots.txt", $robots_txt );
	    $domain_robots_txt{ $domain_name } = 1;
	} else { 
	    $domain_robots_txt{ $domain_name } = 0;
	}
    }

    if ( $domain_robots_txt{ $domain_name } ) {
	return $robot_rules->allowed( $url );
    } else {
	return 1;
    }
}

#####

sub push_by_weight {
    my $list_ref = shift;
    my $str = shift;
    my $default_weight = shift;
    if ( not defined $default_weight ) {
	$default_weight = 1;
    }

    die "internal error" if not defined $str or ref $list_ref ne "ARRAY";

    # if we have no weights hash, just push one instance and return
    if ( not defined $weights or ref $weights ne "HASH" ) {
	writelog "pushing only $default_weight copy because we have no weights hash\n" if $debug;
	push @$list_ref, $str;
	return;
    }

    # else push a variable number depending on how many weight regexes
    # match

    my $copies = 0;
    my $working_weight = calc_weight( $str, $weights, $default_weight );
    while ( $working_weight >= 1 ) {
	push @$list_ref, $str;
	--$working_weight;
	++$copies;
    }
    
    if ( $working_weight > 0 && (rand) < $working_weight ) {
	push @$list_ref, $str;
	++$copies;
    }

    writelog qq(pushed $copies copies of $str\n) if $verbose;
    return $copies;
}

#####

sub get_image {
    return if 0 == scalar @image_list;
    writelog "getting an image...\n";

    my $i = int rand scalar @image_list;
    my $img_url = $image_list[ $i ];

    if ( not defined $img_url ) {
	writelog "undefined element in list: $i\'th element of \@image_list\n";
	return;
    } elsif ( $img_url !~ m/[^\s]/ ) {
	writelog "blank element in list: $i\'th element of \@image_list\n";
	return;
    }

    my $allowed = allowed_by_robot_rules( $img_url );
    if ( not $allowed ) {
	writelog "skipping $img_url because it's disallowed by robots.txt\n";
	# don't return just yet, else we'd have to duplicate code for removing
	# urls from list
    }

    my $page_content;
    if ( $allowed ) {
	if ( defined $found_in{ $img_url } ) {
	    writelog "found $img_url on page $found_in{$img_url}\n";
	    undef $found_in{ $img_url };
	} else {
	    writelog qq("$img_url" not found in \%found_in hash\n);
	}

	$page_content = get_page( $img_url );
	if ( not defined $page_content ) {
	    writelog qq(undefined page content\n);
	    randpause;
	    return;
	}
    }

    my $n0 = scalar @image_list;
    @image_list = grep { $_ ne $img_url } @image_list;
    if ( $verbose ) {
	my $removed_count = $n0 - (scalar @image_list);
	writelog "removed $removed_count instances of $img_url from list\n";
    }

    push @img_already_downloaded, $img_url;

    if ( not $allowed ) {
	return;
    }

    my $filename = $img_url;
    $filename =~ s(.*/)();
    # handles IMG SRC like
    # http://meandthebee.files.wordpress.com/2011/06/pict0907.jpg?w=500&#038;h=375
    $filename =~ s!(.*(png|jpe?g|gif|svg))\?.*!$1!i;

    if ( $title_text{ $img_url } ) {
	writelog qq(prepending alt or title text from IMG tag: "$title_text{$img_url}"\n) if $verbose;
	if ( length( $title_text{ $img_url } ) > 128 ) {
	    writelog "(after truncating alt or title text to 128 chars)\n" if $verbose;
	    $filename = substr( $title_text{ $img_url }, 0, 128) . "__" . $filename;
	} else {
	    $filename = $title_text{ $img_url } . "__" . $filename;
	}
    }

    my $domain_name = $img_url;
    $domain_name =~ s![a-z]+:[0-9]*//([^/]+)/?.*!$1!;

    # some websites have image filenames that are just a small number,
    #  and no alt/title text.  those tend to conflict with similar
    #  filenames from other websites.  so prepend the domain name.

    if ( $filename =~ m/^[0-9]+\.(jpe?g|gif|svg|png)$/ ) {
	$filename = $domain_name . "__" . $filename;
	writelog qq(filename consists solely of digits so prepending domain name: $filename\n) if $verbose;
    }

    # also prepend the domain name if we're not on tumblr and we're downloading a file
    # named 'tumblr_[long series of hex digits]_[baseten digits].[fileextension]'
    if ( $domain_name !~ m/^tumblr\.com/   # the base site, not a user's blog like beautifulcentury.tumblr.com
	 and  $filename =~ m/tumblr_ [[:xdigit:]]+ _ [0-9]+ \. (jpe?g|gif|svg|png)$/x )
    {
	$filename =~ s/tumblr_/$domain_name . "__"/xe;
	writelog qq(filename is 'tumblr_ plus random hex digits so switching to domain name: $filename\n) if $verbose;
    }    

    $filename =~ s!%([0-9A-F][0-9A-F])!chr(hex($1))!eg;
    if ( $filename =~ m/%[0-9A-F][0-9A-F]/ ) {
	writelog qq(mis-double-encoded URL: $filename\n);
	$filename =~ s!%([0-9A-F][0-9A-F])!chr(hex($1))!eg;
    }

    #$filename =~ s![^A-Za-z0-9_.-]!_!g;
    # turn spacing and certain punctuation into underscores
    $filename =~ s![#*/\\?|<>:\s]!_!g;
    save_image( $filename, $page_content );
    if ( $max_pic_count  &&  ++$pic_count > $max_pic_count ) {
	writelog qq(exiting because we have downloaded $pic_count images\n);
	exit;
    }

    randpause;
    return;
}

#####

my $consecutive_linkless_pages = 0;

sub get_html_page {
    writelog "getting an html page...\n";

    my $i = int rand scalar @url_list;
    my $url_this_page = $url_list[ $i ];
    if ( not defined $url_this_page ) {
	writelog "undefined element in list: $i\'th element of \@url_list\n";
	return;
    } elsif ( $url_this_page !~ m/[^\s]/ ) {
	writelog "blank element in list: $i\'th element of \@url_list\n";
	return;
    }

    my $base_url = base_directory( $url_this_page );
    my $domain_name = domain_name( $url_this_page );

    my $allowed = allowed_by_robot_rules( $url_this_page, $domain_name );
    if ( not $allowed ) {
	writelog "skipping $url_this_page because it's disallowed by robots.txt\n";
	# don't return just yet, else we'd have to duplicate code for removing
	# urls from list
    }

    my $page_content;
    my $redirected_url;
    
    if ( $allowed ) {
	( $page_content, $redirected_url ) = get_page( $url_this_page );
	if ( not defined $page_content ) {
	    writelog qq(undefined page content\n);
	    randpause;
	    return;
	}
    }

    # we've downloaded the page (or decided not to); now remove all
    # instances of that URL from the list
    my $n0 = scalar @url_list;
    @url_list = grep { $_ ne $url_this_page } @url_list;
    if ( $verbose ) {
	my $removed_count = $n0 - (scalar @url_list);
	writelog "removed $removed_count instances of $url_this_page from list\n";
    }

    if ( $redirected_url ) {
	writelog "substituting $redirected_url for $url_this_page\n"   if $verbose;
	$url_this_page = $redirected_url;
	$base_url = base_directory( $url_this_page );
	$domain_name = domain_name( $url_this_page );

	if ( $verbose ) {
	    writelog "base_directory() returned " . ( defined $base_url ? $base_url : "undefined" ) . "\n";
	    writelog "domain_name() returned " . ( defined $domain_name ? $domain_name : "undefined" ) . "\n";
	}
	die if not defined $base_url or not defined $domain_name;
    }
    
    push @page_already_downloaded, $url_this_page;

    return unless $allowed;

    my @urls_this_page;
    my @images_this_page;

    while ( $page_content =~ m/<(a|area)\s+[^>]*href\s*=\s*"([^"]+)"/gi ) {
	writelog qq(found double-quoted url: $2\n) 	if $verbose;
	push @urls_this_page, $2;
    }
    while ( $page_content =~ m/<(a|area)\s+[^>]*href\s*=\s*'([^']+)'/gi ) {
	writelog qq(found single-quoted url: $2\n)	if $verbose;
	push @urls_this_page, $2;
    }

    while ( $page_content =~ m/<frame\s+[^>]*src\s*=\s*"([^"]+)"/gi ) {
	writelog qq(found double-quoted frame url: $1\n) 	if $verbose;
	push @urls_this_page, $1;
    }
    while ( $page_content =~ m/<frame\s+[^>]*src\s*=\s*'([^']+)'/gi ) {
	writelog qq(found single-quoted frame url: $1\n) 	if $verbose;
	push @urls_this_page, $1;
    }

    while ( $page_content =~ m/<(a|area) \s+ [^>]* href \s* = \s* ([^'"\s>]+)/gix ) {
	writelog qq(found unquoted url: $2\n) 	if $verbose;
	push @urls_this_page, $2;
    }


    while ( $page_content =~ m/<img ([^>]+) src \s* = \s* ['"] ([^'"]+) ['"] ([^>]*)>/gix ) {
	my $other_attributes = $1 . ' ' . $3;
	my $src = $2;
	writelog qq(found image tag, src = "$src", other attributes = "$other_attributes"\n)	if $verbose;

	####TODO check alt and title attributes against string weights

	my $title;
	if ( $other_attributes =~ m/alt\s*=\s*"([^"]+)"/i
	     or $other_attributes =~ m(alt\s*=\s*'([^']+)')i ) 
	{
	    $title = $1;
	    writelog qq(found alt text "$title"\n)  if $verbose;
	}

	if ( $other_attributes =~ m/title\s*=\s*"([^"]+)"/i
	     or $other_attributes =~ m/title\s*=\s*'([^']+)'/i ) 
	{
	    $title = $1;
	    writelog qq(found title text "$title"\n)	 if $verbose;
	}

	my ($height, $width);
	if ( $other_attributes =~ m/height\s*=\s*["']?([0-9]+)["']?/i ) {
	    $height = $1;
	}
	if ( $other_attributes =~ m/width\s*=\s*["']?([0-9]+)["']?/i ) {
	    $width = $1;
	}
	# style="width: 16px; height: 16px; ......"
	if ( $other_attributes =~ m/style.*width:\s*([0-9]+)px/i ) {
	    $width = $1;
	}

	if ( $other_attributes =~ m/style.*height:\s*([0-9]+)px/i ) {
	    $height = $1;
	}

	if ( not defined $height and $src =~ m/([0-9]+)px/ ) {
	    $height = $width = $1;
	    writelog "got width/height from " . $height . "px substring in SRC $src\n"	if $verbose;
	}

	if ( not defined $height and $src =~ m/([0-9]+)x([0-9]+)/ ) {
	    $width = $1;
	    $height = $2; 
	    writelog "got width/height from " . $width . "x" . $height . " substring in SRC $src\n"	if $verbose;
	}


	if ( not &too_small( $width, $height ) ) 
	{
	    if ( $src !~ m(^[a-z]+://) ) {
		writelog qq($src is not a full URL\n)	if $verbose;
		if ( $src =~ m!^/! ) {
		    writelog qq(prepending $domain_name to image source $src\n)	if $verbose;
		    $src = $domain_name . $src;
		} else {
		    writelog qq(prepending $base_url to image source $src\n)	if $verbose;
		    $src = $base_url . $src;
		}
	    }

	    if ( push_by_weight \@image_list, $src ) {
		$found_in{ $src } = $url_this_page;
		if ( $title ) {
		    $title_text{ $src } = $title;
		}
	    }
	}
    }

    if ( 0 == scalar @urls_this_page ) {
	if ( $debug ) {
	    # let us test for false negatives, <a href> tags that somehow
	    # a buggy regex failed to recognize
	    writelog qq(no links found in $url_this_page, saving page content for debugging\n);
	    my $filename = $url_this_page;
	    $filename =~ s(^[a-z]+://)();
	    $filename =~ s/[^A-Za-z0-9._-]/_/g;
	    if ( not $filename =~ m/html?$/i ) {
		$filename .= ".html";
	    }
	    savepage $filename, $page_content;
	    writelog qq(saved "$filename"\n);
	}

	# this tends to happen when we're on a wireless network on
	# which one has to connect to the hotel/restaurant website and
	# click agree on some license agreement before going anywhere
	# else.  programs other than the web browser tend not to work.
	# instead of telling us we can't connect, it gives us a bogus
	# empty page.  (e.g. Holiday Inn Express in Gastonia NC)

	if ( ++$consecutive_linkless_pages > 10 ) {
	    writelog qq(exiting because we have found 10 consecutive pages with no links, indicating a persistent connectivity problem\n);
	    exit;
	}	    
	randpause;
	return;
    } else {
	$consecutive_linkless_pages = 0;
    }

    foreach ( @urls_this_page ) {
	my $url = $_;
	
	# if it's an anchor within this same page, skip it
	# or if it's a javascript command, or email address
	if ( $url =~ m/^#/ or $url =~ m/^javascript/i  or $url =~ m/^mailto/i ) {
	    next;
	}

	if ( $url !~ m(^[a-z]+://) ) {
	    writelog qq($url is not a full URL\n)	if $verbose;
	    if ( $url =~ m!^/! ) {
		writelog qq(prepending $domain_name to $url\n)	if $verbose;
		$url = $domain_name . $url;
	    } else {
		writelog qq(prepending $base_url to $url\n)	if $verbose;
		$url = $base_url . $url;
	    }
	}

	if ( grep { $_ eq $url } @img_already_downloaded ) {
	    writelog qq($url already downloaded earlier\n)  if $verbose;
	}

	if ( $url =~ m/(jpe?g|png|gif|svg)$/i ) {
	    if ( push_by_weight \@image_list, $url ) {
		$found_in{ $url } = $url_this_page;
	    }
	} elsif ( $url =~ m/(mp3|avi|m4a)$/i ) {
	    writelog qq(discarding $url because it's media we don't want\n); #  if $verbose;
	} else {
	    push_by_weight \@url_list, $url;
	}
    }
    randpause;
    return;
}

#####

sub reduce_list {

    my @reduced = @_;
    
    # hysteresis; don't go through this reduction every time we
    # accumulate a few more items.  So delete a random 10% of the items.

    while ( scalar @reduced > ($max_list_size * 0.9) ) {
	splice @reduced, (int rand scalar @reduced), 1;
    }

    writelog qq(after reducing, list has ) . scalar @reduced . qq( elements\n);

    return @reduced;
}

######

sub set_pausetime {
    return if $daily_count < 1;

    my $image_prob = 0;
#    my $html_prob = 0;
    my $tot = $upperbound - $lowerbound;
    my $basic_pause =  SEC_PER_DAY / $daily_count;

    if ( $balancing_algorithm == LOGARITHMIC ) { 
        #(($lowerbound + rand($upperbound - $lowerbound)) < (log (1 +
        #scalar @image_list)))
	$image_prob = ((log (1 + scalar @image_list)) - $lowerbound) / $tot;
	if ( $image_prob < 0 ) {
	    $image_prob = 0;
	}

#	$html_prob = ($upperbound - (log (1 + scalar @image_list))) / $tot;
#	if ( $html_prob < 0 ) {
#	    $html_prob = 0;
#	}
    } elsif  ( $balancing_algorithm == LINEAR ) { 
	#if (($lowerbound + rand ($upperbound - $lowerbound)) < scalar @image_list) {
	$image_prob = ((scalar @image_list) - $lowerbound) / $tot;
	if ( $image_prob < 0 ) {
	    $image_prob = 0;
	}

#	$html_prob = ($upperbound - (scalar @image_list)) / $tot;
#	if ( $html_prob < 0 ) {
#	    $html_prob = 0;
#	}

    } elsif ( $balancing_algorithm == EQUAL ) {
	$image_prob = 0.5;
    } else {
	die "unimplemented";
    }

    my $pausetime = ( $basic_pause * $image_prob );
    writelog "setting pause time of $pausetime sec based on daily image count $daily_count and image download probability of $image_prob\n"	if $verbose;
    if ( $pausetime < 60 ) {
	writelog "setting minimum pause time of 60 sec\n"	if $verbose;
	$pausetime = 60;
    }
    &set_config_var( 'pausetime' => $pausetime );
}


########## begin main function


my $help;
my $manual;

if ( not GetOptions(
    	'q|quiet' 	=> \$quiet,
    	'v|verbose' 	=> \$verbose,
    	'l|logfile=s' 	=> \$logfile,
    	'i|input-file=s' 	=> \$seed_file,
	'T|timestamps' => \$timestamps,
	'e|max-errors=i'	=> \$max_errs,
	'exponential!'	=> \$do_exponential_backoff,
	'w|wait=s' 	=> \$pausetime_s,
	'weights=s' 	=> \$weights_file,
	'r|random-wait!'	=> \$random_wait,
    	'd|dir=s' 	=> \$save_dir,
	'D|daily=i'	=> \$daily_count,
	't|timelimit=s'  => \$timelimit_s,
	'c|image-count=i' 	=> \$max_pic_count,
	'A|area=i' 	=> \$minarea,
	'H|height=i' 	=> \$minheight,
	 'W|width=i' 	=> \$minwidth,
	'or' 		=> \$minima_or,
	 'b|balancing=s' => \$balancing_s,
	 'h|help'       => \$help,
	 'M|manual'     => \$manual,
    ) ) 
{
    help;
    exit(1);
}

if ( $help ) {
    help;
    exit(0);
} elsif ( $manual ) {
    pod2usage(-verbose => 2);
    exit;
}


# note that debug messages from the WeightRandomList library will go to
# stdout, not to our log file
WeightRandomList::set_debug( $debug );

my $pausetime = DEFAULT_PAUSE;
if ( $pausetime_s ) {
    $pausetime = &parsetime( $pausetime_s );
    if ( $pausetime < 0 ) {
	die qq(invalid format for wait option "$pausetime_s"\n);
    }
}

if ( $timelimit_s ) {
    $timelimit = &parsetime( $timelimit_s );
    if ( $timelimit < 0 ) {
	die qq(invalid format for time-limit option "$timelimit_s"\n);
    }
}

if ( defined $seed_file ) {
    if ( -T $seed_file ) {
	seed_urls( $seed_file );
    } else {
	die qq("$seed_file" does not exist or is not a text file\n);
    }
}

if ( defined $weights_file ) {
    if ( -T $weights_file ) {
	$weights = weights_from_file( $weights_file );
    } else {
	die qq("$weights_file" does not exist or is not a text file\n);
    }
}

if ( $balancing_s ) {
#    if ( $balancing_s =~ m/^(linear\(([0-9]+),([0-9]+)\)|log(arithmic)?\(([0-9]+),([0-9]+)\)|equal)$/ ) {
    if ( $balancing_s =~ m/^(linear|log|equal)(.*)/ ) {
	my $algorithm = $1;
	my $arg = $2;
	if ( $algorithm eq "linear" ) {
	    $balancing_algorithm = LINEAR;
	} elsif  ( $algorithm eq "log" ) {
	    $balancing_algorithm = LOGARITHMIC;
	} elsif ( $algorithm eq "equal" ) {
	    $balancing_algorithm = EQUAL;
	} else {
	    # this branch shouldn't be reached unless we have a bug
	    die;
	}

	if ( $arg =~ m/([.0-9]+)\s*,\s*([.0-9]+)/ ) {
	    $lowerbound = $1;
	    $upperbound = $2;
	    if ( $lowerbound >= $upperbound or $lowerbound < 1 or $upperbound < 1 ) {
		die "lower bound must be less than upper bound and both greater than zero\n";
	    }
	    if ( $balancing_algorithm == LOGARITHMIC ) {
		$lowerbound = log $1;
		$upperbound = log $2;
	    }
	}
    } else {
	print "Invalid format for --balancing option\n";
	pod2usage( { -verbose => 1 } );
    }
}

if ( $save_dir ) {
    if ( not -d $save_dir ) {
	warn "$save_dir doesn't exist, trying to create it...\n";
	mkdir $save_dir 	or die "creating $save_dir failed\n";
    }
} else {
    $save_dir = getcwd;
}

image_library_init(
    'scriptname' 	=> $script,
    'logfile' 		=> $logfile,
    'debug' 		=> $debug,
    'timestamps' 	=> $timestamps,
    'max_errs' 		=> $max_errs,
    'random_wait'	=> $random_wait,
    'do_exponential_backoff' => $do_exponential_backoff,
    'verbose'		=> $verbose,
    'pausetime'		=> $pausetime,
    'minheight'		=> $minheight,
    'minwidth'		=> $minwidth,
    'minarea'		=> $minarea,
    'minima_or'		=> $minima_or,
    'save_dir' 		=> $save_dir,
    );

if ( $debug and $balancing_s ) {
    writelog "balancing algorithm = $balancing_algorithm, lower bound = $lowerbound, upper bound = $upperbound\n";
}

while ( my $arg = shift ) {
    if ( $arg =~ m/^http/ ) {
	push_by_weight \@url_list, $arg, DEFAULT_SEED_WEIGHT;
    } else {
	writelog qq("$arg" doesn't look like a URL\n);
    }
}

if ( scalar @url_list < 1 ) {
    die "No URLs given on command line or in --input-file\n";
}


while ( 1 ) {
    writelog qq(we have ) . scalar @image_list . qq( image URLs and ) 
	. scalar @url_list . qq( other URLs in our lists\n);

    if ( scalar @url_list < 1 ) {
	# this should not happen except when there's a persistent
	# connectivity problem where we throw away URLs thinking them
	# bad instead of retrying them later.  That should happen
	# less often now that there's a fix in get_html_page() to
	# exit if we download ten consecutive pages with no URLs in them.
	# but I suspect there may be other problems that could cause this.

	writelog qq(exiting because we have exhausted our list of URLs\n);
	exit;
    }

    if ( $timelimit > 0 && ( time - $starttime ) > $timelimit ) {
	writelog qq(exiting because time limit has passed\n);
	exit;
    }

    if ( scalar @image_list > $max_list_size ) {
	writelog qq(reducing image list\n);
	@image_list = reduce_list @image_list;
    }

    if ( scalar @url_list > $max_list_size ) {
	writelog qq(reducing URL list\n);
	@url_list = reduce_list @url_list;
    }

    set_pausetime;

    if ( $balancing_algorithm == LOGARITHMIC ) { 
	if (($lowerbound + rand($upperbound - $lowerbound)) < (log (1 + scalar @image_list))) {
	    get_image;
	} else {
	    get_html_page;
	}
    } elsif  ( $balancing_algorithm == LINEAR ) { 
	if (($lowerbound + rand ($upperbound - $lowerbound)) < scalar @image_list) {
	    get_image;
	} else {
	    get_html_page;
	}
    } elsif ( $balancing_algorithm == EQUAL ) {
	if ( scalar @image_list == 0  or  rand() < 0.5 ) {
	    get_html_page;
	} else {
	    get_image;
	}
    } else {
	die "unimplemented";
    }
}


