#! /usr/bin/perl -w

=head1 NAME

textual-slideshow.pl -- slowly scroll random paragraphs from various files


=head1 SYNOPSIS

  # Do a slideshow based on .txt files in the $HOME directory.
  # (Or the current directory if there is no $HOME environment variable.)
  textual-slideshow.pl

  # Run on the $HOME/homepage directory and print paragraphs only from html files.
  textual-slideshow.pl --extensions html $HOME/homepage

  # Get txt and html files from both $HOME/ebooks and $HOME/homepage
  textual-slideshow.pl --extensions txt,html $HOME/ebooks $HOME/homepage

  # Get all text files regardless of extension
  textual-slideshow.pl --type 

  # Scroll text faster (sleep a shorter amount of time between lines):
  textual-slideshow.pl --sleep 0.02 $HOME/ebooks

  # Scroll text slower:
  textual-slideshow.pl --sleep 0.1 $HOME/ebooks

  # Wrap text to fit the terminal window, if a given paragraph doesn't look
  # like poetry, code, etc.
  textual-slideshow.pl --wrap $HOME/ebooks

  # Wrap text to a 60-character margin regardless of the terminal window size.
  textual-slideshow.pl --wrap --margin 60 $HOME/ebooks

  # Print filename and line number of the source file before printing a paragraph.
  textual-slideshow.pl --print-filenames $HOME/ebooks


=head1 DESCRIPTION

This program takes a list of text files or directories containing text files,
then randomly display paragraphs from those files to the terminal one line at a time
with short pause between each line.  It's interactive, with various keystrokes
speeding up or slowing down the scrolling, getting help, exiting or doing other
things.

You can run this on your home directory and see what output you get, or you can
customize the output by giving it particular subdirectories, supplying a weights
file to make files whose paths match certain patterns more or less likely to be
chosen, setting the file extensions it looks for, etc.

This will probably produce more interesting output if you have a lot of 
HTML or plain text ebooks on your hard drive.  A good source of them is
L<Project Gutenberg.|https://www.gutenberg.org/>  In a future version I plan
to have this script grab paragraphs from .epub files as well.

While the slideshow is running, press h for help, q for quit.


=head1 COMMAND LINE OPTIONS

Usage: $scriptname [options] [filenames and/or directory names]

=over

=item -s --sleep=[number]		

Sleep time (seconds to sleep for each character output; default
is 0.045 seconds per character).

=item -f --print-filenames	

Print filename and line number of each paragraph

=item -w --wrap		

Re-wrap paragraphs to fit the terminal window

=item -W --force-wrap	

Re-wrap everything, even if it looks like source code, poetry etc.

=item -m --margin=[number]

Margin (integer); overrides terminal window width for rewrap

=item -t --type

Identify files with Perl -T filetest instead of file extension

=item -e --extensions=[string]

Look for specified list of extensions (comma-separated) instead of just .txt

=item -l  --min-length=[number]    

Don't print paragraphs shorter than this (in characters).

=item -L  --max-length=[number]    

Don't print paragraphs longer than this (in characters).

=item -p --max-paragraphs

Maximum number of random paragraphs to store in memory at once

=item --preload

Start loading and printing paragraphs before we finish collecting the list of
filenames. (Use this if you're giving it a a huge directory hierarchy and the
long startup time is annoying.)

=item --weights=filename	

Specify a file of regular expression/weight pairs to determine how likely
filenames matching various regexes are to be used.  See L</WEIGHTS> section.

=item -h --help

Get brief help.

=item -M --manual

Get detailed help (this manual).

=back

=head2 Command line options used for testing

These options are probably only useful if you're hacking the script and adding
features, etc.

=over

=item -d --debug

Turn on debug messages.

=item -S --startup-test

Exit after doing startup, before starting the main print-and-check-keystroke
loop.  Probably only useful along with --debug.

=item -U --utf8-test

Execute C<test_utf8_handling()>, which tests utf8 code by reading, parsing, decoding
and outputting some test files.

=back


=head1 INTERACTIVE COMMANDS

While the slideshow is running, you can press certain keys to change its
behavior or exit.

=over

=item h, ?

Display help.

=item +

Speed up scrolling.

=item -

Slow down scrolling.

=item f

Turn on/off printing of filenames and line numbers.

=item w

Turn on/off wrapping of text to fit terminal window.

=item d

Turn on/off debug mode.

=item [spacebar]

Pause (any key resumes).

=item q, x, ESC

Quit.	

=back


=head1 WEIGHTS

If the --weights argument is used, the following filename will be read and
interpreted as a weights file, where each line consists of a regular expression,
a tab, and a weight (a nonnegative number).  If a filename (including the full
path) matches any of the regular expressions in the weights file, then the
corresponding weight will be applied; 0 means to exclude files matching that
regular expression, 0.5 means to give those files half the default probability
of being chosen to print paragraphs from, 2 means to give them double the
default probability.  If multiple regular expressions match a single filename,
the applied weights are multiplied.  (The default probability is normally
1/n where n = the number of text files found in the target directories that
match the criteria given on the command line, but once you start applying
weights it gets a a bit tricky.)

I mostly use this to block out certain directories that contain log files, etc., and
also to make sure that the Project Gutenberg directory (which contains the vast
majority of the plain text files on my hard drive) doesn't overly dominate the
output.

Some example weights:

  /packagelist	0
  \.log$	0
  etext	1.5
  Documents/fic	1.5
  etext/gutenberg_2001	0.67

These weights will exclude any text files found in the packagelist directory and
any .log files, wherever they're found.  (I have a C<cron> job that saves a list
of the installed packages every day, to make it easier to get back to a good
state after an OS reinstall or upgrade.)  It will give extra weight to files in
the etext and Documents/fic directories, and reduced weight to the Project
Gutenberg directory (as mentioned above, to keep it from overwhelming the output).

See L<WeightRandomList.html> for more information.


=head1 DEPENDENCIES

This script uses the following Perl modules:

C<Encode>, C<Encode::Guess>, C<Getopt::Long>, C<Pod::Usage>, C<Text::Wrap>, and C<Time::HiRes> are
in the standard library.  

C<HTML::Parser> and C<Term::ReadKey> are available from
CPAN.  

C<WeightRandomList.pm> is included with this distribution.


=head1 AUTHOR

Jim Henry III, L<http://jimhenry.conlang.org/software/>


=head1  ACKNOWLEDGMENTS

Thanks to people at perlmonks.org for help.
L<http://perlmonks.org/index.pl?node_id=869446>


=head1 LICENSE

This script is free software; you may redistribute it and/or modify it under
the same terms as Perl itself.


=head1 BUGS and LIMITATIONS

Currently assumes the input files are in one of ASCII, Latin-1, CP437 (the
old IBM PC charset), UTF-8 or UTF-16.  Will probably garble the output if any of
the input files are in some other format.

Paragraphs taken from HTML files are not associated with a line number,
only a filename.

There is no way to get higher-verbosity debug messages except
editing the C<my $debug = 0;> statement to set it to a higher
value.

Symbolic links found in the target directories are ignored, as are all other
things that aren't regular files (devices, etc.) or subdirectories.

=cut


=head1 TO DO LIST

Make the algorithm that determines how likely we are to save a particular
paragraph from a particular file depend on configurable variables instead
of a constant.

Add a command line option to skip paragraphs shorter than a certain 
minimum length.

Use Archive::Zip and HTML::Parser to get paragraphs from epubs as well.

Add command line option to suppress ANSI colors in paragraphs from
HTML files.  Maybe also option to customize which ANSI codes are
associated with which tags?

Fancier display: use Term::ANSIColor to let the user specify colors
and text decoration for various HTML tags via names instead of hard-coding
one for the <em> family and one for <strong>.

Or take full control of screen, as graphical window, and print each paragraph in
a different font?  Maybe figure out how to make it a screensaver plugin for
GNOME and other desktops?

=cut

use Term::ReadKey;
use Term::ANSIColor;
use Text::Wrap;
use Time::HiRes qw(time);
use strict;
use warnings;
use Getopt::Long;
#use Encode qw(encode decode);
use Encode;
use utf8;
use HTML::Parser;
use WeightRandomList;
use Pod::Usage;


# cp437 is the old IBM PC character set, which is used in a 
# lot of Gutenberg etexts from the 1990s
#use Encode::Guess qw( iso-8859-1 cp437 );
use Encode::Guess qw( iso-8859-1 );

#use open qw( :std :encoding(UTF-8) );

# proportion between file size and probability that a given
# paragraph will be saved for later display.  the smaller this number,
# the fewer paragraphs from each file are saved.
use constant SAVE_RATE => 2000;

# read paras at a fast rate till we have this many, then slow down a bit
use constant TARGET_INIT_PARAS => 5000;

# user pressing + or - will speed up or slow down by 25%, dividing
# or multiplying the sleep-time per character output by this amount
use constant PAUSE_INCR_RATIO => 1.25;
use constant DEFAULT_MARGIN => 78;

# config variables
my $debug = 0;
my $check_type = 0;
my $check_extensions;
my $print_filenames = 0;
my $wrapping = 0;	# if true, rewrap paragraphs that don't look specially formatted
my $rewrap_margin = 0;	# override terminal width with this wrap margin
my $force_wrap = 0;	# if true, wrap everything, even poetry or other formatted text
my $preload_paras = 0;
my $pause_len = 0.045;  	# seconds per character of output
my $max_paras = 100000;
my $html_extensions = qr(\.x?html?);

my $err_count = 0;
my $max_errs = 3;

my $weights_file = '';
my $weights;

my @file_extensions = ();

# these two parallel arrays are logically one data structure.  before I started
# using HTML::Parsers, I had a bug with HTML processing where they get out of
# sync, which wouldn't have happened if they were one data structure.

# it might make sense to replace them with a hash where the keys are filename
# and line number concatenated and the values are paragraphs?
# but that wouldn't work with html files, where we don't have line numbers.
# all the paragraphs for one html file would be under one key.
# so maybe one array of hashes, each with two keys and two values?

my @paras;
my @indices;

my @filenames;

#  Supporting CLICOLOR
#  
#  https://bixense.com/clicolors/ proposes a standard for enabling and disabling
#  color output from console commands using two environment variables, CLICOLOR
#  and CLICOLOR_FORCE. Term::ANSIColor cannot automatically support this
#  standard, since the correct action depends on where the output is going and
#  Term::ANSIColor may be used in a context where colors should always be
#  generated even if CLICOLOR is set in the environment. But you can use the
#  supported environment variable ANSI_COLORS_DISABLED to implement CLICOLOR in
#  your own programs with code like this:
#  
#  if (exists($ENV{CLICOLOR}) && $ENV{CLICOLOR} == 0) {
#      if (!$ENV{CLICOLOR_FORCE}) {
#          $ENV{ANSI_COLORS_DISABLED} = 1;
#      }
#  }


####TODO make these configurable on the command line
my $em_color = 'underline'; # 'bright_green underline';
my $strong_color = 'bold';  # 'bright_magenta bold';

use constant NO_MAX => -1;
my $min_length = 0;  # TODO add to pod and usage
my $max_length = NO_MAX;

=head1 FUNCTIONS

=head2 main function

Initialize variables based on command line options, initialize our list of
files, then start the main event loop.

=cut

my $starttime = time;
# these four variables are only used in the main function
my $help = 0;
my $manual = 0;
my $startup_test = 0;
my $utf8_test = 0;

Getopt::Long::Configure('bundling');

srand;

my $go_rc = GetOptions(
    's|sleep=f' => \$pause_len,
    'f|print-filenames' => \$print_filenames,
    'w|wrap' => \$wrapping,
    'W|force-wrap' => \$force_wrap,
    'm|margin=i' => \$rewrap_margin,
    't|type' => \$check_type,
    'e|extensions=s' => \$check_extensions,
    'weights=s' 	=> \$weights_file,
    'p|max-paragraphs=i' => \$max_paras,
    'l|min-length=i'       => \$min_length,
    'L|max-length=i'       => \$max_length,
    'preload' => \$preload_paras,
    'h|help' => \$help,
    'M|manual' => \$manual,
    'd|debug' => \$debug,
    'S|startup-test' => \$startup_test,
    'U|utf8-test'     => \$utf8_test,
);

if ( $help or not $go_rc ) {
    &display_usage;
    exit(0);
} elsif ( $manual ) {
    pod2usage(-verbose => 2);
    exit;
}

validate_options();

if ( $force_wrap or $rewrap_margin ) {
    $wrapping = 1;
}


ReadMode 3; # 'noecho'; don't echo the user's keystrokes to stdout (gets turned off in END block)
binmode(STDOUT, ":encoding(UTF-8)");
test_utf8_handling()    if $utf8_test;

# any remaining arguments are filenames or directory names
build_file_list();
apply_weights()         if $weights_file;
exit(0)                 if $startup_test;
main_event_loop();

#==================================================

END {
    print color('reset');
    ReadMode 0;
}


=head2 test_utf8_handling()

Used only in testing; reads a couple of simple files and tests whether their
high-bit characters look right when printed.  (Gets called when the -U option
is set.)

=cut

sub test_utf8_handling {
    $| = 1;  # auto-flush stdout
    for my $f ( qw( test/test-ascii-with-entities.html test/test-utf8-with-entities.html
		 /home/jim/etext/saved-websites/michael_repton/zarathustra47/Atragam/LPILexicon.html ) ) {
	my @p =  paragraphs_from_html_file( "$f" );
	printf "filename: %s number of paragraphs: %d\n\n", $f, scalar @p;
	foreach ( @p ) {
	    print $_ . "\n\n";
	}
    }
    my $textfile = 'test/test-utf8.txt';
    print "filename: $textfile\n\n";
    while ( not scalar @paras ) {
	add_random_paras_from_text_file( $textfile );
    }
    foreach ( @paras ) {
	print $_;
    }
    
    exit(0);
}


=head2 interactive_help() and display_usage()

Give help on the command line or while running.

=cut

sub interactive_help {
    print STDERR<<HELP;

== Help ==

h,?		Get this message
+		Speed up display
-		Slow down display
f		Turn on/off printing of filenames and line numbers
w		Turn on/off wrapping of text to fit terminal window
d		Turn on/off debug mode
[spacebar]	Pause (any key resumes)
q, ESC		Quit	

Press a key to continue

HELP
#,
    ReadKey( 0 );
}

sub display_usage {
    # strip the path from the filename by which we were called
    my $scriptname = $0;
    $scriptname =~ s/.*\///;

    print STDERR <<USAGE;

Textual slideshow -- by Jim Henry III, http://jimhenry.conlang.org/software/

Usage: $scriptname <options> <filenames and/or directory names>

Options:

-s --sleep=number            Sleep time (seconds to sleep for each
                             character output, default is $pause_len seconds per character)
-f --print-filenames         Print filename and line number of each paragraph
-w --wrap                    Re-wrap paragraphs to fit the terminal window
-W --force-wrap              Re-wrap everything, even if it looks like source code,
                             poetry etc.
-m --margin=[number]         Margin (integer); overrides terminal window width for
                             rewrap
-t --type                    Identify files with Perl -T filetest instead of a file 
                             extension
-e --extensions              Look for specified list of extensions (comma-separated)
                             instead of just .txt
--weights=filename           Specify a file containing regular expression/weight pairs
                             to determine how likely filenames matching various regexes
                             are to be used.  See the manual for details.
-l  --min-length=[number]    Don't print paragraphs shorter than this (in characters).
-L  --max-length=[number]    Don't print paragraphs longer than this (in characters).
-p --max-paragraphs          Maximum number of random paragraphs to store in memory
                             at once
--preload                    Start loading and printing paragraphs before we finish
                             collecting the list of filenames (use this only if 
                             working on a huge dir hierarchy and the long startup 
                             time is annoying)
-h --help                    Print this brief help message.
-M --manual                  Show the user manual.

While the slideshow is running, press h for help, q for quit.

USAGE

    exit(0);
}


=head2 validate_options()

Sanity-check the variables we were given on the command line.

=cut

sub validate_options {

    my $errs = 0;
    if ( $min_length !~ m/^[0-9]+$/ ) {
	print "-l --min-length value must be an integer\n";
	$errs++;
    }

    if ( $max_length != NO_MAX and $max_length !~ m/^[0-9]+$/ ) {
	print "-L --max-length value must be an integer\n";
	$errs++;
    }

    if ( $rewrap_margin !~ m/^[0-9]+$/ ) {
	print "-m --margin value must be an integer\n";
	$errs++;
    }
    if ( $max_paras !~ m/^[0-9]+$/ ) {
	print "-M --max-paragraphs value must be an integer\n";
	$errs++;
    }

    if ( $pause_len < 0 or $pause_len !~ m/^ [0-9]* \. [0-9]+ $ 
				   	 | ^ [0-9]+ (?: \. [0-9]+ )? $ /x ) 
    {
	print "-s --sleep value must be postive floating point number\n";
	$errs++;
    }

    if ( $check_extensions ) {
	@file_extensions = split ',', $check_extensions;
    }

    if ( $debug ) {
        print "arg sleep 		 \$pause_len  $pause_len\n";
        print "arg print-filenames 	 \$print_filenames  $print_filenames\n";
        print "arg wrap 		 \$wrapping  $wrapping\n";
        print "arg force-wrap 	         \$force_wrap  $force_wrap\n";
        print "arg margin 		 \$rewrap_margin  $rewrap_margin\n";
        print "arg type 		 \$check_type  $check_type\n";
        print "arg extensions 	 	 \$check_extensions  $check_extensions\n";
        print "arg weights 		 \$weights_file  $weights_file\n";
        print "arg max-paragraphs 	 \$max_paras  $max_paras\n";
        print "arg min-length 	 	 \$min_length  $min_length\n";
        print "arg max-length 	 	 \$min_length  $max_length\n";
        print "arg preload 		 \$preload_paras  $preload_paras\n";
        print "arg help 		 \$help  $help\n";
        print "arg manual 		 \$manual  $manual\n";
        print "arg debug 		 \$debug  $debug\n";
        print "arg startup-test 	 \$startup_test  $startup_test\n";
        print "arg utf8-test 	 	 \$utf8_test  $utf8_test\n";
    }
    
    exit(1) if $errs;
}



=head2 main_event_loop()

Repeatedly call slow_print() to print randomly chosen paragraphs, intermittently
calling add_paras() and delete_oldest_paras() as needed to maintain the list of 
random paragraphs.  slow_print() will take care of checking for user keystrokes
and passing them to handle_keystroke().

=cut

sub main_event_loop {
    my $exhausted = 0;
    while ( 1 ) {
	my $n_paras = scalar @paras;
	my $range;
	if ( $exhausted ) {
	    $range = $n_paras;
	} else {
	    if ( files_paras_taken_from() == scalar @filenames ) {
		print "not adding any more paragraphs because we've already taken some from every file\n" if $debug;
		$exhausted = 1;
		$range = $n_paras;
	    } else {
		$range = $n_paras >= TARGET_INIT_PARAS ? ( $n_paras * 1.5 ) : (TARGET_INIT_PARAS * 1.5);
	    }
	}
	my $target = int rand $range;
	print "\$target == $target\n"	if $debug;
	if ( $target >= $n_paras ) {
	    &add_paras;
	    &avg_paras 	if $debug;
	    if ( (scalar @paras) > $max_paras ) {
		&delete_oldest_paras;
	    }
	    next;
	} else {
	    &slow_print( $target );
	}
    }
}


=head2 build_file_list()

Iterate over the filenames and/or directory names given on the command line and
build a list of filenames matching the criteria given via command line options
(--extensions and --type).

=cut

sub build_file_list {
    # if no path or file names found on command line, search current
    # directory
    if ( 0 == scalar @ARGV ) {
	my $default = $ENV{HOME} ? $ENV{HOME} : ".";
	print "no filenames or dir names on command line, so defaulting to $default\n\n"; # 	if $debug;
	&recurse_dir( $default );
	if ( 0 == scalar @filenames ) {
	    die "no text files found in $default\n";
	}
    }

    while ( my $path = shift @ARGV ) {
	if ( -d $path ) {
	    &recurse_dir( $path );

	    # this is faster but less flexible:

	    #my $textfiles_list = `find $path -name \\*.txt -print0`;
	    #my @textfiles = split /\0/,  $textfiles_list;
	    #push @filenames, @textfiles;
	} elsif ( -f $path ) {
	    push @filenames, $path;
	} else {
	    print STDERR "$path is not a directory or a file\n";
	}
    }

    if ( not @filenames ) {
	die "No suitable files found\n";
    }

    print "got " . scalar @filenames . " filenames in " . ( time - $starttime )  . " seconds, ready to do slideshow\n\n"	if $debug or $startup_test;

}


=head2 apply_weights()

If the --weights command line option was given, read the weights file and apply
the weights to the list of filenames.

=cut

sub apply_weights {
    if ( defined $weights_file ) {
	if ( -T $weights_file ) {
	    $weights = weights_from_file( $weights_file );
	} else {
	    die qq("$weights_file" does not exist or is not a text file\n);
	}
    }
    
    print STDERR "fixing to make weighted list, \@filenames now has " . ( scalar @filenames ) . " items\n"  	if $debug;
    WeightRandomList::set_debug( $debug );
    @filenames = @{ make_weighted_list( $weights, \@filenames ) };
    print STDERR "made weighted list, \@filenames now has " . ( scalar @filenames ) . " items\n" 	if $debug;
}


=head2 want_file( filename )

Check if we want this file based on the --type and --extensions command line 
options, and if neither option was given, check if it has a .txt extension.

=cut

# comments refer to benchmark tests using ~/Documents/ and ~/etext/ dirs
sub want_file {
    my $filename = shift;
    if ( $check_type && -T $filename ) {
	# 15061 filenames in 0.692 sec
	return 1;
    } elsif ( $check_extensions ) {
	# 8857 filenames in . 0.794 sec with 
	# --extensions=txt,pl,html,htm 
	if ( ( grep { $filename =~ m(\.$_$) } @file_extensions ) && -e $filename) {
	    return 1;
	}
    } else {
	# this test finds 5066 files in ~/Documents and ~/etext 
	#  in 0.218 sec
	if ( $filename =~ m(\.txt$) &&  -e $filename  ) {
	    return 1;
	}
    }
    return 0;
}


=head2 recurse_dir( directory name )

Iterate over the directory we're given, call ourselves recursively if it
contains subdirectories, and add regular files to the C<@filenames> list if they're
wanted.  Periodically print a random paragraph from files collected so far if
the --preload command line option was given.

=cut

sub recurse_dir {
    my $dirname = shift;
    $dirname =~ s!/$!!;  # strip trailing slash if any; we will add slash later when needed

    my $dirh;
    opendir $dirh, $dirname;

    my $fname;
    while ( $fname = readdir $dirh ) {
	my $name = $dirname . "/" . $fname;
	if ( -d $name ) {
	    # don't recurse on . or .. or dotfiles generally
	    if ( $fname !~ /^\./ ) {
		print "$name is a dir\n"	if $debug >= 3;
		&recurse_dir( $name );
	    }
	} elsif ( &want_file( $name ) ) {
	    print "$name is a text file\n"	if $debug >= 3;
	    push @filenames, $name;
	} else {
	    print "skipping $name\n"	if $debug >= 3;
	}
	if ( $preload_paras and ((rand 100) < 1) ) {
	    print "preload mode so printing something while still gathering filenames (" . (scalar @filenames) . " read so far)\n"	if $debug;
	    if ( scalar @filenames ) {
		&add_paras;
		&slow_print( int rand scalar @paras );
	    } else {
		print "...but there are no usable files yet\n" 	if $debug;
	    }
	}
    }
    closedir $dirh;
}


=head2 handle_keystroke( key )

Take appropriate action on keys pressed by the user.

If I were writing this now, or now if the actions per keystroke were more
than two or three lines each, I'd probably use a hash of keystrokes mapped to
function references, but it's not that long and is a low priority for
refactoring.

=cut

sub handle_keystroke {
    my $key = shift;
    if ( $key eq "w" or $key eq "W" ) {
	$wrapping = not $wrapping;
	print "turned wrapping mode " . ( $wrapping ? "on\n" : "off\n" ) 	if $debug;
    }

    if ( $key eq "f" or $key eq "F" ) {
	$print_filenames = not $print_filenames;
	print "turned " . ( $print_filenames ? "on" : "off" ) 
	    . " printing of filenames and line numbers\n" 	if $debug;
    }

    if ( $key eq "d" or $key eq "D" ) {
	$debug = not $debug;
	print "turned debug mode " . ( $debug ? "on\n" : "off\n" );
    }

    if ( $key eq "+" ) {
	$pause_len /= PAUSE_INCR_RATIO;
	print "decreased pause per char to $pause_len sec\n" if $debug;
    }

    if ( $key eq "-" ) {
	$pause_len *= PAUSE_INCR_RATIO;
	print "increased pause per char to $pause_len sec\n" if $debug;
    }

    # pause on spacebar
    if ( $key eq " " ) {
	$key = ReadKey( 0 );
    }

    if ( $key eq "h" or $key eq "H" or $key eq "?" ) {
	&interactive_help;
    }

    if ( $key eq "\e" or $key =~ m/[qQxX]/ ) {  #TODO document x in interactive help
	print "quitting\n" if $debug;
	exit(0);
    }
}


=head2 slow_print( index )

Takes a subscript to the @paras array, gets the paragraph, wraps it if needed,
and prints it slowly, one line at a time, checking for keystrokes between them.

=cut

sub slow_print {
    my $subscript = shift;
    my $para = $paras[ $subscript ] . "\n\n";

    # preferably refrain from putting too-short
    # paragraphs in the list to begin with.
    
    # This is getting off-by-about-five errors -- with the extra newlines at the end,
    # a one-line paragraph should be three characters longer than the visible characters
    # in the line (need an adjustment to the comparison for that).  Not off by five???y
#    return if length($para) < $min_length;
    
    if ( $wrapping ) {
	print "fixing to wrap $indices[ $subscript ]\n" 	if ( $debug );
	$para = &rewrap( $para );
    }

    # include the filename/line number index in the paragraph instead
    # of printing it separately, so we don't have to duplicate the
    # sleeping/keystroke-handling logic below.
    if ( $print_filenames or $debug ) {
	$para = $indices[ $subscript ] . $para;
    }

    my @lines = split /\n/, $para;
    # adding a trailing newline has to be done here, not where we collect the
    # paragraphs, because wrap() strips trailing whitespace
    push @lines, "\n";   
    foreach ( @lines ) {
	{ # for trapping "Wide character in print" errors
	    use warnings FATAL => 'all';
	    eval { print $_ . "\n" };
	    if ( $@ ) {
		print "print failed with error $@\n";
		print_as_ascii( $_ );
	    }
	}

	# If the user presses a key before the pause time for
	# the current line has passed, we don't necessarily skip
	# to the next line with no further pause.
	my $start = time;
	my $remaining_wait =  $pause_len * length $_;
	while ( time < ( $start + $remaining_wait ) ) {
	    my $key = ReadKey( $remaining_wait );
	    if ( defined $key ) {
		&handle_keystroke( $key );
	    }
	    # the $pause_len might have been changed by user's keystroke
	    $remaining_wait = ($pause_len * length $_) - (time - $start);
	}
    }
#    print "\n\n";
}


=head2 print_as_ascii( string )

If trying to print something gets an error (e.g. "Wide character in
print"), we call this to convert the high-bit characters to hex numbers and
print the line/paragraph as ASCII.  That shouldn't happen anymore after the
recent fixes (2023/9/7).

=cut

sub print_as_ascii {
    my $line = shift;
    my @c = split '', $line;
    foreach my $c ( @c ) {
	if ( ord $c < 0x80 ) {
	    print $c;
	} else {
	    printf "%%%02x", ord $c;
	}
    }
    print "\n";
}


=head2 HTML parsing functions

This group of functions is grouped in a scope block because they share some
state variables.  Basically it's one function to initialize our HTML::Parser
object for a given file, parse, and return the list of paragraphs found; and
three callback functions that handle various tags and text blocks.  We build up
a paragraph with each text block, and when we hit an opening or closing tag of
certain types, we decode the paragraph as needed and add the working paragraph
to an array.

=cut

{ # scope block for html parsing functions with shared variables
    my $working_paragraph = '';
    my @paras_found;
    my $utf8_parser = 0;		# whether we are running HTML::Parser in utf8 mode
    my $open_emphasis = 0;		# whether there is an open <em> <strong> etc tag

    sub paragraphs_from_html_file {
	my $filename = shift;

	@paras_found = ();  # empty out what's left over from last time we were called
	my $p = HTML::Parser->new(
	    api_version         => 3,
	    start_h             => [ \&start_tag_handler, 'tagname, attr'],
	    end_h               => [ \&end_tag_handler,   'tagname'],
	    text_h 	        => [ \&text_handler,  'dtext'],
	    end_document_h      => [ \&end_of_html_file_handler ],
	    marked_sections     => 1,
	    );
	$utf8_parser = is_utf8( $filename );
	print "$filename: setting parser's utf8_mode( $utf8_parser )\n"  if $debug;
	$p->utf8_mode( $utf8_parser );

	$p->parse_file( $filename );
	return @paras_found;
    }
    
    sub start_tag_handler {
	my $tag = shift;
	if ( $tag =~ m/p|li|td|h[1-6]/ ) {
	    add_paragraph_from_html();
	} elsif ( $tag =~ m/^(i|em|cite)$/ ) {
	    $working_paragraph .= color( $em_color ); # "\e[96;4m";
	    $open_emphasis = 1;
	} elsif ( $tag =~ m/^(b|strong)$/ ) {
	    $working_paragraph .= color( $strong_color ); # "\e[93;1m";
	    $open_emphasis = 1;
 	} elsif ( $tag eq 'body' && $working_paragraph ) {
	    # discard any non-title text from the <head> element -- most likely an
	    # embedded stylesheet or Javascript
	    $working_paragraph = '';
	}
    };

    sub end_tag_handler {
	my $tag = shift;
	if ( $tag =~ m/^(title|p|li|td|h[1-6]|body|html)$/ ) {
	    add_paragraph_from_html();
	} elsif ( $tag =~ m/^(b|i|em|strong|cite)$/ ) {
	    $working_paragraph .= color('reset'); #"\e[0m";
	    $open_emphasis = 0;
	} elsif ( $tag =~ m/style|script/ ) {
	    # discard any CSS or JavaScript found in body
	    $working_paragraph = '';
	}
    };

    sub text_handler {
	$working_paragraph .= shift;
    };

    sub end_of_html_file_handler {
	# take care of situations where the last paragraph in the file has no close
	# tags of any kind (p, body, html)
	add_paragraph_from_html();
    }
  
    sub add_paragraph_from_html {
	# Strip leading and trailing whitespace, collapse whitespace.
	$working_paragraph =~ s/^\s+//m;
	$working_paragraph =~ s/\s+$//m;
	$working_paragraph =~ s/\s+/ /mg;

	return unless $working_paragraph;

	if ( $open_emphasis ) {
	    # the paragraph has an open emphasis tag of some kind but no matching close tag.
	    # our ANSI color coding will spill over to other paragraphs if we don't close it.
	    $working_paragraph .= "\e[0m";
	    $open_emphasis = 0;
	}	    

	# Apparently if HTML::Parser parser is running in utf8 mode, it produces output that
	# needs to be decoded as utf8, whereas if it's running in ASCII mode, its output is already
	# utf8?  That seems counterintuitive.  I tested this with two fairly similar HTML files,
	# both with several high-bit characters represented by HTML entities (e.g. &ldquo;) and
	# one that also had a literal e-acute character so the file would be read as UTF8.  This
	# code works on both, finally, after hours of hair-tearing frustration.
	if ( $utf8_parser and $working_paragraph =~ m/[\x100-\xFFFF]/ ) {
	    eval { $working_paragraph = Encode::decode('UTF-8', $working_paragraph, Encode::FB_WARN); };
	    if ( $@ ) {
		my $eval_err = chomp $@;
		print qq(Encode::decode() croaked: "$eval_err"\n)	if $debug;
		$working_paragraph = '';
		return;
	    }
	}
	print "adding to \@paras_found:" . $working_paragraph . "\n"   if $debug >= 2;
	push @paras_found, $working_paragraph;
	$working_paragraph = '' ;
    };
} # end scope block for HTML parsing functions


=head2 add_paras()

Pick a random file, snag a random subset of paragraphs from it.  

If/when we add epub support, this if/else would get another branch and we'll
write an add_random_paras_from_epub_file() function.

=cut

sub add_paras {
    my $filename = $filenames[ int rand scalar @filenames ];
    if ( $filename =~ m/$html_extensions/ ) {
	add_random_paras_from_html_file( $filename );
    } else { #  some other kind of plain text file
	add_random_paras_from_text_file( $filename);
    }
}


=head2 add_random_paras_from_html_file( filename )

Get a list of the parsed paragraphs from an HTML file, then
randomly pick a subset of them to add to the C<@paras> array.

Note that we can't save the line numbers to the @indices array
because HTML::Parser doesn't give our callback functions access
to line numbers in the source file, as far as I can tell.
I might be able to work around that by parsing the file in chunks
rather than in one call to parse_file, but I'm not sure it's
worthwhile compared to other work I want to get done (like adding
.epub support).

=cut

sub add_random_paras_from_html_file {
    my $fn = shift;
    my @all_paras = paragraphs_from_html_file( $fn );
    my $size = -s $fn;
    my $prange = $size / SAVE_RATE;  ### TODO may want to work in a fudge factor for HTML vs text - what proportion is tags?
    my $number_saved = 0;
    
    for my $p ( @all_paras ) {
	if ( rand $prange <= 1 && good_length($p) ) {
	    push @paras, rewrap( $p );
	    push @indices, "$fn\n\n";
	    $number_saved++;
	}
    }
    log_paras_got_count( $number_saved, $fn, $size );
}


=head2 is_utf8( filename )

Check whether a file is encoded as utf8.  To be used by paragraphs_from_html_file()
so it can tell the parser object whether the file it's working on is utf8.

If I support epubs at some point, I'll want to revise this so it can take its
argument as a string representing the contents of an HTML file as well as a filename.
Maybe use a hash with different keys representing different types of argument?

=cut

sub is_utf8 {
    my $filename = shift;
    my $fh;
    if ( not open $fh, $filename ) {
	print STDERR qq(can't open "$filename" for reading\n);
	if ( ++$err_count >= $max_errs ) {
	    die "too many file open errors\n";
	}
	return;
    }
    my @lines = <$fh>;
    close $fh;
    my $decoder = get_decoder( \@lines );
    if ( $decoder ) {
	print "is_utf8 found decoder name ". $decoder->name . "\n"  if $debug;
	return $decoder->name eq 'utf8' ? 1 : 0;
    } else {
	return 0;
    }
}


=head2 good_length( paragraph )

Check a paragraph's length against the --min-length and --max-length options.
Return true if within range, false if not.

=cut

sub good_length {
    my $para = shift;
    chomp $para;
    my $len = length $para;
    if ( $len < $min_length ) {
	return 0;
    }
    if ( $max_length != NO_MAX  and  $len > $max_length ) {
	return 0;
    }
    return 1;
}



=head2 project_gutenberg_license_header( line ) and project_gutenberg_license_footer( line )

Check whether the current line looks like the beginning or end of a Project Gutenberg
ebook's license agreement section.  Return true if it does, false if it doesn't.

=cut

sub project_gutenberg_license_header {
    my $line = shift;
    return ( $line =~ m/(the\s+)*Project\s+Gutenberg.*e-?(text|book)/i
	     or $line =~ m/START.*SMALL\s+PRINT/i
	     or $line =~ m/END OF.*PROJECT\s+GUTENBERG E-?(TEXT|BOOK)/
	     or $line =~ m/START.*FULL LICENSE/ );
}

sub project_gutenberg_license_footer {
    my $line = shift;
    return ( $line =~  m/END.*SMALL\s+PRINT/
	     or $line =~ m/START OF THE PROJECT\s+GUTENBERG.*E-?(BOOK|TEXT)/
	     or $line =~ m/END.*FULL LICENSE/ );
}


=head2 add_random_paras_from_text_file( filename )

Read all the lines from a file into an array, figure out the encoding, then
iterate over the lines and build paragraphs from the non-blank lines between
blanks.  Randomly pick a subset of the paragraphs to add to C<@paras>.
Filter out Project Gutenberg license.

It's overly long even after a lot of refactoring and I may refactor
further by splitting it into a function to parse the file into paragraphs
and another one to pick a random subset of those to save, like I did with
the HTML parsing functions.

=cut

sub add_random_paras_from_text_file {
    my $filename = shift;
    my $fh;
    if ( not open $fh, $filename ) {
	print STDERR qq(can't open "$filename" for reading\n);
	if ( ++$err_count >= $max_errs ) {
	    die "too many file open errors\n";
	}
	return;
    }

    # first read in raw mode, then figure out encoding later
    binmode( $fh, ':raw' );

    my $size = -s $fh;
    if ( $size == 0 ) {
	return;
    }

    print "getting paragraphs from $filename\n"	if $debug;

    # we can't know how to decode the file until we examine it.
    # slurp the file first:
    my @lines = <$fh>;
    return unless @lines;
    my $decoder = get_decoder( \@lines );
    return unless $decoder;

    # we want the number of pararaphs taken from each file to be roughly
    #  the same, whether it's a short one or a long one.
    my $prange = $size / SAVE_RATE;

    my $paras_added = 0;
    my $saved_indices = 0;
    my $in_para = 0;
    my $saving = 0;
    my $current_para = "";
    my $legalese = 0;
    my $index = '';

    for ( my $i = 0; $i < scalar @lines; $i++ ) {
	my $linenum = $i+1;
	my $line = decode_line( $decoder, $lines[$i], $filename, $linenum ) ;

	$line =~ s/\r//g;		# leave newlines alone but strip CR
	if ( $line !~ m/^\s*$/ ) {	# non-blank
	    if ( project_gutenberg_license_header($line) ) {
		$legalese = 1;
		next;
	    }

	    if ( $in_para == 0 ) {
		if ( (not $legalese) and (rand $prange <= 1) ) {
		    #push @indices, "lines $linenum+ of $filename\n\n";
		    $index = "lines $linenum+ of $filename\n\n";
		    $saving = 1;
		}
	    }
	    $in_para = 1;

	    if ( project_gutenberg_license_footer($line) ) {
		$legalese = 0;
	    }
	} else {		# blank line signaling end of paragraph
	    if ( $saving and good_length($current_para) ) {
		push @paras, $current_para . "\n\n";
		push @indices, $index;
		$paras_added++;		# I'm not sure I need counts this anymore 
		$saved_indices++;	# now that I'm pushing both onto the lists in one place.
	    }
	    $current_para = '';
	    $index = '';
	    $in_para = 0;
	    $saving = 0;
	}

	if ( $saving ) {
	    $current_para .= $line;
	}
    } # end for each line

    # handle the last paragraph of the file if there's no blank lines at the end
    if ( $saving ) {
	$current_para .= "\n\n";
	push @paras, $current_para;
	push @indices, $index;
	$paras_added++;
	$saved_indices++;
    }
    close $fh;
    log_paras_got_count( $paras_added, $filename, $size );

    # this assertion is needed because we have to push items onto @paras and @indices
    # at differnt points in the code ( @indices at the beginning of a paragraph and
    # @paras at the end).  It used to fail with the old kludgy HTML parsing code if
    # we encountered a paragraph containing only an empty tag like <hr>.
    if ( $paras_added != $saved_indices ) {
	print "file $filename charset " 
	    . ( defined $decoder
		? ( ref $decoder ? $decoder->name : $decoder )
		: "(undefined)" ) 
	    . "\n\$saved_indices = $saved_indices,  \$paras_added = $paras_added\n";
	die "assertion failed: $paras_added must be equal to $saved_indices\n";
    }
} # end add_paras()


=head2 delete_oldest_paras()

Delete the oldest ten percent of saved paragraphs.

=cut

sub delete_oldest_paras {
    my $delete_count = int 0.1 * (scalar @paras);
    print "fixing to remove the oldest $delete_count paragraphs\n" 	if $debug;
    while ( $delete_count-- ) {
	shift @paras;
	shift @indices;
    }
}


=head2 log_paras_got_count()

Add the number of paragraphs passed to us to a hash on filename, which is used
by avg_paras().

=cut

{ #scope block
    my %paras_per_file;
    sub log_paras_got_count {
	my ($number_of_paras_saved, $fn, $size) = @_;
    #    $paras_per_file{$fn} //= 0;
	$paras_per_file{$fn} += $number_of_paras_saved;
	if ( $debug ) {
	    print "added $number_of_paras_saved paragraphs and indices from $fn of $size bytes\n";
	    print "now have " . @paras . " paragraphs in memory, taken from " 
		. (scalar keys %paras_per_file) . " files \n";
	    print "\tand " . scalar @indices . " filename/line number indices\n";
	}
     }

=head2 files_paras_taken_from()

Returns the number of files we've taken paragraphs from.

=cut

    sub files_paras_taken_from {
	return scalar keys %paras_per_file;
    }

    
=head2 avg_paras()

Return average number of paragraphs taken from each file.

=cut

    sub avg_paras {
	my $total = 0;
	foreach ( keys %paras_per_file ) {
	    $total += $paras_per_file{$_};
	}
	my $n_files = scalar keys %paras_per_file;
	return 0 unless $n_files;
	my $avg = $total / $n_files;
	if ( $debug ) {
	    if ( not $n_files ) {
		print "no paragraphs read yet\n";
	    } else {
		printf "average number of paragraphs taken from each file: %.3f\n", $avg;
	    }
	}
	return $avg;
    } 
} # end scope block


=head2 rewrap( paragraph )

Wrapper (heh) for Text::Wrap::wrap().  Do some tests to see if we need to wrap the current paragraph
and what the margin should be, then wrap it.

=cut

sub rewrap($) {
    my $para = shift;
    $para =~ s/ +$//;
    my $margin = DEFAULT_MARGIN;

    # if we have a forced value from the command line, use it.  Otherwise,
    # get the columns width of the terminal.  (user might have changed
    # terminal window size or font size since we printed the last paragraph,
    # so call GetTerminalSize() every time.)
    if ( $rewrap_margin ) {
	$margin = $rewrap_margin;
    } else {
	my ($width_chars, $height_chars, $width_pixels, $height_pixels)
	    = Term::ReadKey::GetTerminalSize STDOUT;
	
	if ( defined $width_chars ) {
	    $margin = $width_chars - 1;
	} elsif ( $debug ) {
	    print "Term::ReadKey::GetTerminalSize failed\n";
	}
    }

    if ( length($para) <= $margin ) {
	print "not wrapping because paragraph fits into one line\n" if $debug;
	return $para;
    }
    

    my @lines = split /\n/, $para;
    my $longline_found = 0;
    my $spacing = 0;
    foreach ( @lines ) {
	if ( (length $_) > 40 ) {
	    $longline_found = 1;
	    last;
	}
	if ( m/^   / || m/^\t/ ) {
	    $spacing = 1;
	    last;
	}
    }

    if ( not $force_wrap and ( $spacing or !$longline_found ) ) {
	print "paragraph looks like poetry or some other formatted text so not rewrapping\n"    if $debug;
	return $para;
    }

    print "using margin $margin\n" 	if $debug;
    $Text::Wrap::columns = $margin;

    # this is necessary when the input could include mixed Unix and Windows
    # text files, because wrap() leaves \r alone while handling \n

    # turn all sequences of whitespace into single space
    $para =~ s/\s+/ /g;
    # strip trailing space
    $para =~ s/ +$//;

    return (wrap( "", "", $para )) . "\n";
}



=head2 get_decoder( array ref )

Takes a reference to an array of lines from a file whose encoding we don't know
yet.  Gets a decoder object for it and then rebuilds the array with different
newlines if the encoding requires it.  Returns the decoder object.

=cut

sub get_decoder {
    my $arrayref = shift;
    my $decoder = &figure_out_encoding( $arrayref );
    if ( not $decoder ) {
	# figure_out_encoding() should already have logged err msg if necessary
	return;
    }

    if ( $decoder->name =~ /UTF-(16|32)LE/ ) {
	# split done by <> was wrong in raw mode, redo it now that we know what
	# a little-endian newline really looks like
	my $whole_file = join "", @$arrayref;
	if ( $decoder->name eq "UTF-16LE" ) {
	    @$arrayref = split /\n\0/, $whole_file;
	} else {
	    @$arrayref = split /\n\0\0\0/, $whole_file;
	}
	print "rejoined and resplit because of littleendian format\n" if $debug;
    }
    return $decoder;
}

=head2 decode_line( decoder, line, filename, line number )

Decode the line if possible, print error messages if not.

=cut

sub decode_line {
    my ($decoder, $line, $filename, $linenum) = @_;
    my $decoded_line = eval { $decoder->decode( $line ) };
    if ( $@ ) {
	my $eval_err = chomp $@;
	print qq(Encode->decode() failed horribly at $filename line $linenum: "$eval_err"\n)	if $debug;
	return;
    }

    # this used to happen with LE charsets before I added the rejoin/replit
    # code above.  keep it in case something else occasionally goes wrong
    # with decode but doesn't cause eval to fail.
    if (  not utf8::valid($line) ) {
	if ( $debug ) {
	    print "have invalid utf8 after decoding at $filename line $linenum:\n";
	    print $line;
	    print "\n";
	}
	return;
    }
    
    return $decoded_line;
}



=head2 figure_out_encoding( reference to  array of strings )

Use Encode::Guess plus some heuristics to figure out the
probable encoding of a file, based on an array of lines
from the file (passed by reference because it could be
very big and we don't need to modify it).  Return a decoder 
object.

"In the script above, there's a line commented out giving cp437 as one
of the defaults to initialize Encode::Guess; if I have that line in,
every file with high-bit characters gets a bad decoder. This is not
surprising, given the man page's warning that Encode::Guess is bad at
distinguishing different 8-bit encodings from each other. I have a lot
of cp437 etexts lying around, but I'm pretty sure I can write an
ad-hoc routine to distinguish them from the Latin-1 text files -- in
theory both code pages use all the characters from 0x80 to 0xFF, but
in practice, only accented Latin letters characters in the 80 to A5
range are common in cp437 text files and only characters in the C0 to
FF range are common in Latin-1 files."

L<https://www.perlmonks.org/?node_id=870638>

=cut

sub figure_out_encoding {
    my $arr_ref = shift;
    if ( not ref $arr_ref eq "ARRAY" or scalar @$arr_ref == 0 ) {
	die "arg to figure_out_encoding() should be ref to non-empty array";
    }

    # do an Encode::Guess on the join of all the lines
    # then if its results are ambiguous, do further heuristics

    # first save the possible byte order mark because Encode::Guess is
    # too lame to specify UTF-16BE or UTF16LE, it just says UTF-16 so
    # the resulting decoder obj then fails if actually used on a
    # random string from the middle of the file.  ditto with UTF-32(BE|LE)
    my $first_two_bytes = substr( $$arr_ref[0], 0, 2 );
    my $first_four_bytes = substr( $$arr_ref[0], 0, 4 );

    my $content = join '', @$arr_ref;
    my $cp437_letters = 0;
    my $latin1_letters = 0;
    my $decoder = eval { Encode::Guess->guess( $content ); };

    if ( $@ ) {
	my $eval_err = chomp $@;
	print qq($_: Encode::Guess->guess() failed horribly: "$eval_err"\n);
	return undef;
    }

    if ( ref $decoder ) {
	print "appears to be " . $decoder->name . "\n"	if $debug;

	if ( $decoder->name eq "UTF-16" ) {

	    # examine byte order mark and recreate decoder as UTF-16BE or UTF16LE
	    if ( $first_two_bytes eq "\xFF\xFE" ) {
		# little-endian
		$decoder = find_encoding( "UTF-16LE" );
		if ( $debug ) {
		    if ( ref $decoder ) {
			print "Encode::Guess thinks the file is UTF-16 and the byte order mark is FFFE so recreated decoder as UTF-16LE\n";
		    } else {
			print "can't find encoding for 'UTF-16LE'\n";
		    }
		}
	    } elsif ( $first_two_bytes eq "\xFE\xFF" ) {
		# big-endian
		$decoder = find_encoding( "UTF-16BE" );
		if ( $debug ) {
		    if ( ref $decoder ) {
			print "Encode::Guess thinks the file is UTF-16 and the byte order mark is FEFF so recreated decoder as UTF-16BE\n";
		    } else {
			print "can't find encoding for 'UTF-16LE'\n";
		    }
		}
	    } else {
		print "Encode::Guess thinks this file is UTF-16 but first two bytes is neither FFFE nor FEFF so aborting decode attempt\n" 	if $debug;
		return undef;
	    }
	    return ( ref $decoder ? $decoder : undef );

	} elsif ( $decoder->name eq "UTF-32" ) {

	    # examine byte order mark and recreate decoder as UTF-32BE or UTF32LE
	    if ( $first_four_bytes eq "\xFF\xFE\x00\x00" ) {
		# little-endian
		$decoder = find_encoding( "UTF-32LE" );
		if ( $debug ) {
		    if ( ref $decoder ) {
			print "Encode::Guess thinks the file is UTF-32 and the byte order mark is FFFE0000 so recreated decoder as UTF-32LE\n";
		    } else {
			print "can't find encoding for 'UTF-32LE'\n";
		    }
		}
	    } elsif ( $first_four_bytes eq "\x00\x00\xFE\xFF" ) {
		# big-endian
		$decoder = find_encoding( "UTF-32BE" );
		if ( $debug ) {
		    if ( ref $decoder ) {
			print "Encode::Guess thinks the file is UTF-32 and the byte order mark is 0000FEFF so recreated decoder as UTF-32BE\n";
		    } else {
			print "can't find encoding for 'UTF-32LE'\n";
		    }
		}
	    } else {
		print "Encode::Guess thinks this file is UTF-32 but first four bytes is not a valid byte order mark so aborting decode attempt\n" 	if $debug;
		return undef;
	    }
	    return ( ref $decoder ? $decoder : undef );

	} elsif ( $decoder->name eq "iso-8859-1" ) {
	    my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 :  scalar @$arr_ref;
	    for ( my $i = 0;  $i < $lines_to_check;  $i++ ) {
		while ( $$arr_ref[$i] =~ m/[\x80-\xA5]/g ) {
		    $cp437_letters++;
		}
		while ( $$arr_ref[$i] =~ m/[\xC0-\xFF]/g ) {
		    $latin1_letters++;
		}
	    }
	    if ( $cp437_letters > $latin1_letters ) {
		$decoder = find_encoding( "cp437" );
		if ( $debug ) {
		    if ( ref $decoder ) {
			print "more chars in cp437 range ($cp437_letters) than in Latin-1 letters range ($latin1_letters) so prob cp437\n";
		    } else {
			print "can't find encoding for 'cp437'\n";
		    }
		}
		return ( ref $decoder ? $decoder : undef );
	    } else {
		print "at least as many chars in Latin-1 letters range ($latin1_letters) as in cp437 range ($cp437_letters) so prob Latin-1\n"	if $debug;
		return $decoder;
	    }
	}
	return $decoder;
    } else {
	if ( $debug ) {
	    print qq(bad decoder returned by Encode::Guess->guess() );
	    print ( ( defined $decoder ) ? qq("$decoder") : "(undefined)" );
	    print qq(\n);
	}
	if ( defined $decoder && $decoder =~ m/utf8/ ) {
	    my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 :  scalar @$arr_ref;
	    for ( my $i = 0;  $i < $lines_to_check;  $i++ ) {
		if ( $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF][\x00-\x7F]/
		     or $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF]$/ 
		     or $$arr_ref[$i] =~ m/^[\x80-\xFF][\x00-\x7F]/ ) 
		{
		    # this can't be utf8 if it has a high-bit char by its lonesome
		    # with low-bit chars or begin/end of line on either side
		    $decoder = find_encoding( "iso-8859-1" );
		    if ( $debug ) {
			if ( ref $decoder ) {
			    print "looks like iso-8859-1, using that encoding\n";
			} else {
			    print "can't find encoding for 'iso-8859-1'\n";
			}
		    }
		    return ( ref $decoder ? $decoder : undef );
		}
		
	    } # end for each line of array
	    $decoder = find_encoding( "utf8" );
	    if ( $debug ) {
		if ( ref $decoder ) {
		    print "looks like utf8, using that encoding\n";
		} else {
		    print "can't find encoding for 'utf8'\n";
		}
	    }
	    return ( ref $decoder ? $decoder : undef );
	    
	} else {
	    print "bad decoder and not possible match for utf8\n"	if $debug;
	    return undef;
	}# end if possible utf8
    } # end if/else we have no good decoder ref yet
}

#=====

# comments on now fixed bug.......



	# Before I added this utf8::valid() check, I was getting this error:

	#   Malformed UTF-8 character (fatal) at
	#   /home/jim/Documents/scripts/textual-slideshow.pl line 587,
	#   <FILE> line 10.

	# at the if clause which checks $line against Project
	# Gutenberg header strings (but not on earlier regex matches on it)
	# if the file is UTF-32LE or UTF-16LE (but UTF-*BE works fine)

	# I think this might be due to the <file slurp> operator
	# breaking the file up incorrectly in :raw mode before
	# we know that the file is UTF-(16|32)

	# maybe should close the file and reopen and reread it if it's one of
	# those LE formats where newline is \x0D\x00 or \x0D\x00\x00\x00?
	# or might can just join it and re-split it now that we know its
	# encoding?



####### fixed bugs:

# handle UTF-8 correctly esp. when converting unicode character
#  entities in strip_html_tags()
# currently getting lots of 'Wide character in print' warnings
# (when calling Text::Wrap::wrap(), which calls Text::Tabs:

#Malformed UTF-8 character (unexpected end of string) in match
#position at /usr/share/perl/5.10/Text/Tabs.pm line 26, <FILE> line
#60.

# and e.g.
#utf8 "\xE2" does not map to Unicode at /home/jim/Documents/scripts/textual-slideshow.pl line 513, <FILE> line 19.

# can't figure out what line 513 has to do with it.  it's not doing
# any string or I/O operation, it's a boolean test:

# 	if ( $saving ) {

# why would that have a unicode error?
# it continues to point at that line if I add dummy arithmetic operations
# to the previous and following lines, so it's not just pointing at that 
# because it's the next statement before or after the real error.

# A test of the relevant blocks of code copied into a separate script
# which runs strip_html_tags on every line reveals that:

# - the error only occurs when we have Latin-1 or cp437 input with
#   high-bit chars, not ASCII HTML with Unicode character entities
#   which then get converted within this program;

# - it's reported as happening on the last line in the main while ( <> ) {}
#   loop, no matter what that line of code is actually doing

# By successively commenting out various lines intended to handle utf8,
# I've pinned down the use open qw(...) line as responsible for the 
# 	utf8 "\xE2" does not map to Unicode
# error.  A test with a simple console mode character map suggests
# that the "Wide character in print" was taken care of by setting
# binmode(STDOUT, ":utf8");

# I haven't figured out what's causing the
#   Malformed UTF-8 character (unexpected end of string)
# error yet.


    
