#! /usr/bin/perl

=head1 NAME

podcatcher.pl -- a highly configurable command line podcatcher 

=head1 SYNOPSIS

  # Download new episodes from the podcasts listed in podconfig.txt.
  podcatcher.pl podconfig.txt

  # Copy new episodes of the podcasts listed in podconfig.txt to the mp3 player.
  podcatcher.pl --copy podconfig.txt

  # Download new episodes and write only to the log file.
  podcatcher.pl --quiet --logfile podcatcher-log.txt podconfig.txt

  # Download new episodes, sleeping an average of 120 seconds after each
  # file downloaded.
  podcatcher.pl --sleep 120 podconfig.txt

  # Save the title and description of each episode from the RSS feed to an html
  # file.
  podcatcher.pl --description podconfig.txt

  # Run in verbose debug mode.
  podcatcher.pl --debug 2 podconfig.txt

  # Get help.
  podcatcher.pl --help


=head1 DESCRIPTION

C<podcatcher.pl> reads a configuration file consisting of a list of podcasts
and attributes for each, then, depending on the mode, either downloads new
episodes of each podcast or copies new episodes to the mp3 player and then moves
them from download directories to long-term storage directories.  The default
mode is to download new episodes; run in copy mode by setting the -c or --copy 
command line option.

Global settings can be set on the command line; most can also be set in the
configuration file, although they won't take effect until that block of the
configuration file is read.  In most cases that won't matter, but e.g. using the
C<--debug> command line option may print some startup messages that aren't
printed if you set C<debug = 1> at the beginning of the config file. 


=head1 COMMAND LINE OPTIONS

=over

=item -d --debug        

Turn on debug mode.  Followed by 1 or 2 for moderate or verbose messages.

=item -s --sleep        

Number of seconds to sleep after downloading each episode.

=item -D --description  

Turn on saving title/description of each episode to an html file.

=item -l --logfile      

Specify where to write the output log.  A %s in the filename will be replaced by
today's date.

=item -q --quiet        

Write only to the logfile, not to terminal.

=item -c --copy         

Instead of downloading new episodes, copy new files to player and move them to
storage.

=item -e --extensions   

A comma-separated list of file extensions for podcast episodes.  Defaults to
mp3,m4a,oga

=item -a --agent

What User-Agent header to use for http requests.

=item -h --help         

Get brief help.

=item -m --manual       

Get long help.

=back

=head2 Example

I run this script in a daily cron job to download each day's podcast episodes:

  #! /bin/bash
  LOG=/home/jim/Downloads/wrap-podcatcher-log-`date +%Y-%m-%d`.txt

  pushd /home/jim/Downloads

  /home/jim/Documents/scripts/crontab-stuff/podcatcher.pl \
  --description \
  --logfile /home/jim/Downloads/podcatcher-log-%s.txt \
  --sleep 60  \
  /home/jim/Documents/podconfig.txt \
  >> $LOG 2>&1

  popd

The redundant logging in the wrapper script probably isn't necessary anymore,
but it's helpful when first setting up a cron job if it doesn't work right the
first time.  If the podcatcher script won't run because of some Perl installation
issue, a missing library, or invalid command line options, you'll see the error in the wrapper log.


=head1 CONFIGURATION FILE

'#' begins a comment.  Each block of options, whether global or per-podcast,
should be separated from the others by at least one blank line.  Each option
consists of a name, equals sign, and value; you can have optional whitespace
at the beginnings of lines or around the equal sign.

If a block begins with the comment C<# TEMPLATE> then that block will not be
parsed and no error messages will be issued for it.  That way we can have a
template block with empty/default variable values that we can copy, paste, and
edit when we want to add a podcast to the config file.

=head2 GLOBAL VARIABLES

These options should typically come at the beginning of the file, although you
can them on for a particular podcast or group of podcasts and turn them off
again for the rest of the file if you want to troubleshoot a podcast that's not
working right (especially debug, logging, and description).

=over

=item extensions

A comma-separated list of file extensions you want to download, defaulting to
mp3,m4a,oga 

=item sleep

Number of seconds to sleep (on average, will vary randomly) after each download.
Default is 30 seconds.

=item description

If set to a nonzero value, will save the title and description of each episode from
the RSS feed to an html file.

=item agent

What User-Agent header to use for http requests.

=item logfile

Set to the filename where you want log messages written, or an empty value to
stop logging.  Probably best to give this a full path, not using ~ as a synonym
for your home directory.

If the logfile name contains a %s, that will be replaced with today's date.

=item quiet

Set to nonzero if you want output only written to the log file, not to the
console.

=item debug

Set to 1 for moderately chatty debug messages and 2 for verbose debug messages.

=back


=head2 PER-PODCAST VARIABLES

An example podcast configuration block:

  name = Be the Serpent  # a podcast of extremely deep literary merit
  url=https://feed.podbean.com/betheserpent/feed.xml
  download_dir=/home/user/Downloads/betheserpent
  limit=1/14            # download one episode every 14 days
  replace=s/^/BtS__/
  player_dir=/media/user/CLIP/Podcasts/
  keep_dir = /home/user/podcasts/betheserpent
  pause=0

=head3 REQUIRED VARIABLES

=over

=item name

A freeform name for the podcast, used only in messages to the user.

=item url

The URL for the podcast's RSS feed.

=item download_dir

The directory to which to download the podcast.  I<This must not be the same
as the download directory for any other podcast.>  I recommend you use
subdirectories under your C<$HOME/Downloads> directory, one for each podcast.

If the directory doesn't exist yet the first time we download episodes for a 
given podcast, it will be created.

=item player_dir

The directory on the mp3 player (or phone) where new episodes of this podcast
are to be copied.  This I<can> be the same for multiple podcasts, or all
podcasts.

podcatcher.pl will not attempt to access this directory or C<keep_dir> unless
it's in copy mode, so you can download new episodes without having your mp3
player/phone mounted.

If the directory doesn't exist yet the first time we copy/move episodes for a 
given podcast, it will be created.

=item keep_dir

The directory where new episodes are to be moved to after they have been 
successfully copied to the mp3 player/phone.  Again, this can be the same
for multiple podcasts, but I recommend having one for each, possibly on an
external hard drive.  It must I<not> be the same as the download directory
for a given podcast, otherwise we'll keep copying the same episodes to
the player every time we run in --copy mode.

If the directory doesn't exist yet the first time we copy/move episodes for a 
given podcast, it will be created.

=back

=head3 OPTIONAL VARIABLES

=over

=item pause

If pause is nonzero, the podcast will be skipped.  This is equivalent to
commenting out every line of the block, except that a log message will be
printed about skipping the podcast.

=item limit 

limit is the maximum number of episodes of a given podcast to download on a
given run.  If set to a fraction like N/D the catcher will only download N
episodes every D days.  (It calculates this by doing a modulus with the number
of days since the epoch; it doesn't keep track of how many days it's been since
you edited the config file to set the limit.)

If this isn't set, the podcatcher will attempt to download all new episodes
of a given podcast every time it's run in download mode.

=item replace

replace must be one or more valid Perl s/// operators (separated by semicolons
and whitespace) which will be applied to podcast filenames before saving them to
disk (e.g., to prefix a consistent string to episodes of a podcast that names
them ep1.mp3, EP_02.mp3, etc).  Regexes can have any modifier except for /e, which
could allow code injection problems.  You must use forward slashes as delimiters.

I mostly use this to prefix a podcast name abbreviation to the filenames
of podcasts which give them unhelpful names like "ep1.mp3" or whatever.  E.g.,
the example above would turn "ep1.mp3" to "BtS__ep1.mp3".  You can also use
it to normalize the filenames in other ways, e.g., expanding "ep" to "episode"
and making sure it's always lowercase, ensuring there's an underscore between
the word "episode' and the number, etc.  E.g.:

C<replace = s/ep(isode)?_*/episode_/i>

will cause ep1.mp3, EPISODE__2.mp3, Ep_3.mp3 etc. to all be normalized
to episode_1.mp3, episode_2.mp3, episode_3.mp3.

If a podcast's episodes sometimes have a prefix but sometimes doesn't, you can
use negative lookahead to add it only when it's needed:

replace = s/^(?!WW-)/WW-/

will prefix "WW-" only to episode filenames that don't already have a WW- prefix.

=item reverse 

If this is set to 1, download the episodes at the top of the RSS feed first.
Default behavior is to download the episodes at the bottom first; usually
RSS feeds are ordered from newest to oldest, but now and then you'll find
a perverse feed that is ordered oldest to newest.

=item header

This lets you set individual HTTP headers for a given podcast.  You can 
have multiple instances of this; its value should be a colon-separated string
with the name of an HTTP header on the left and its value on the right.
E.g.:

   header = x-extra-header: stuff that should be ignored

You probably won't need this.  I added it while trying to debug a stubborn
podcast, but wound up fixing the problem by changing the C<LWP::UserAgent>
settings.

=back


=head1 FILES

The configuration file is described above, as is the log file; either can
have any name the user wishes.  The other files created and used by the 
podcatcher:

=over

=item Downloaded podcast episodes

Saved in the C<download_dir> for each podcast, then copied to the C<player_dir>
and moved to the C<keep_dir>.

=item Episode descriptions

Simple HTML files consisting of the episode title and description taken from 
the RSS feed.  These are saved in the C<download_dir> for each podcast and
have the same filename as the episode, except for replacing .mp3, .oga etc.
with .html.

=item Block lists

A text file named block-list.txt in each C<download_dir>, listing the
episodes we've already downloaded.  You can edit this to delete some lines to
make the podcatcher download them again (preferably not in the middle of a 
run, as that may cause unintended behavior).

=item Bad RSS files

If the podcatcher can't parse an RSS feed, it will save it in the C<download_dir>
to make debugging easier.

=back


=head1 CHANGELOG

=over

=item 2023-09-08

Add -a --agent command line option and 'agent' configuration variable.

Escape regular expression metacharacters in the regular expression for http header names.

Fix invalid pod.

Add function prototypes.

=back


=head1 DEPENDENCIES

This script requires Perl 5.14 or higher.

C<warnings>, C<strict>, C<constant>, C<Getopt::Long>, C<List::Util>, C<File::Copy>,
C<File::Temp>, C<Data::Dumper>, and C<Pod::Usage> are all in the standard
library. C<LWP::UserAgent> and C<XML::RSS::LibXML> are in CPAN.


=head1 AUTHOR 

Jim Henry III, L<http://jimhenry.conlang.org/software/>


=head1 ACKNOWLEDGMENTS

Thanks to jwkrahn, AnomlaousMonk, Your Mother, thomas895, and tobyink at
perlmonks.org. 

L<https://www.perlmonks.org/?node_id=11154272>
L<https://www.perlmonks.org/?node_id=11120857>


=head1 LICENSE

This script is free software; you may redistribute it and/or modify it under
the same terms as Perl itself.


=head1 TO DO LIST

Add support config variables that can expand in later directory variable
settings.  E.g.:

  DL=/home/jim/Downloads
  PLAYER=/media/jim/CLIP/Podcasts

  ....

  name = somepodcast
  download = $DL/somepodcast
  playerdir = $PLAYER/talk

Or maybe just let the user use environment variables in the config file?

Would it be advantageous to delegate the downloading and saving to disk
of the actual podcast episodes to `wget`?  It can do much more thorough error
checking and handling than anything I'm likely to be able to implement in
a reasonable amount of time, and save files to disk progressively as they're
downloaded, using up less memory when downloading large files.

If I don't do that, I should probably add more error checking to the episode
downloading code.

-----

Should use MP3::Tag to check whether the files we've downloaded have metadata,
and if not, supply it from the C<name> config variable and the RSS feed
title/description.  For now I'm running a separate cron job to fix the metadata
in files from Acatalepsis, which is the worst offender of those I'm currently
listening to.

------

Check if a response was a redirect and log that.  Ideally, we would update 
the config file with the new RSS URL if that is redirected or "moved permanently"
or something, but that would require a refactoring of the main function where
we first read the configuration file and then process each podcast in two
separate loops (possibly two functions).  In that case we should probably keep
the original url= line but comment it out?

=cut


use v5.14;
use warnings;
use Getopt::Long;
use LWP::UserAgent;
use List::Util qw( any uniq );
use File::Copy;
use File::Temp;		# only used by check_rss_feed_via_wget()
use Data::Dumper;
use XML::RSS::LibXML;
use Pod::Usage;

use constant MAX_ERRS => 3;
use constant SECS_PER_DAY => 86400;
# identify ourselves as Mozilla which lets us see RSS feeds where
# libwww-perl is forbidden 
use constant DEFAULT_AGENT => "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0";

our $version = "0.9";

my %config = (
    debug => 0,                         # set to 2 for more verbose logging
    logfile => '',   # podcatcher-log-%s.txt',
    sleep => 30,
    extensions => 'mp3,m4a,oga',
    copymode => 0,
);

#    extensions => [ qw( mp3 m4a oga ) ],

# m4a files make my mp3 player crash on boot
#my $exts = join "|", qw( mp3 oga );
#my $extension_regex = qr(\.(?:$config{extensions}));
my $extension_regex; 
my $ua;
    

=head1 FUNCTIONS

=head2 usage()

Prints help message.

=cut

sub usage {

    print <<HELP;

podcatcher.pl version $version
http://jimhenry.conlang.org/software/

Usage: $0 [options] podcast-list-file

-d --debug        Turn on debug mode.  Followed by 1 or 2 for moderate or verbose 
                  messages.
-s --sleep        Number of seconds to sleep after downloading each episode.
-D --description  Turn on saving title/description of each episode to a text file.
-l --logfile      Specify where to write the output log.  A %s in the filename will 
                  become the date.
-q --quiet        Write only to the logfile, not to terminal.
-c --copy         Instead of downloading new episodes, copy new files to player and 
                  move them to storage.
-h --help         Print this message.
-m --manual       Print full manual.
-e --extensions   A comma-separated list of file extensions for podcast episodes.
-a --agent        What User-Agent header to use for http requests.

The podcast list file name comes last.  See the full manual for details
on this configuration file.

HELP

    exit;
}


# ===== basic utility routines

sub writelog($);

=head2 test_regex( regex )

Returns true if the arg is one or more valid s/// operators, separated by
semicolons and possible whitespace, with any number of modifiers (but not
/e), and contains nothing else (prevent code injection attacks in the replace
field) 

Due to technical limitations, only slashes are allowed as delimiters.  I tried
matching with arbitrary delimiters (see the commented-out line) and it gave
false negatives on any regex containing backreferences.  It seems that \\1 in
a negated character class doesn't match the character that the first
parenthesis matched, but literal \ or 1.

=cut

sub test_regex($) {
    my $regex = shift;
    if ( $regex ) {
        #if ( $regex =~ m!^s(\W) ([^\\1]+) \1 ([^\\1]*)( \1 [msgxiadlun]*$)!x ) {
	if (  $regex =~ m!^(s/ ([^/]+) / ([^/]*) / [msgxiadlun]*)    # at least one of these
                           (\s* ; \s* s/ ([^/]+) / ([^/]*) / [msgxiadlun]*)*  # maybe some of these
                           $!x ) {
            writelog "search: $1 replace: $2\n"  if $config{debug} > 1;
	    return 1;
	} else {
	    writelog "invalid/unsafe regex\n";
            writelog "search: $1 replace: $2\n";
	    return 0;
	}
    } else {
	return 0;
    }

}


=head2 set_extension_regex( extensions )

Take a comma-separated list of extensions and turn it into a pipe-separated
list and pre-compiles it as a regular expression.

=cut

sub set_extension_regex($) {
    my $exts = shift;
    if ( $exts !~ m/^[a-z0-9]+(,[a-z0-9]+)*$/ ) {
	writelog "$exts doesn't look like a valid set of file extensions.";
	exit;
    }
    $exts =~ s/,/|/;
    $extension_regex = qr(\.(?:$exts))i;
    writelog "set extension list to $exts\n"   if $config{debug};
}



=head2 verify_dir( dirname )

Take a directory name.  Add a trailing slash if it doesn't already have
one.  Test whether it exists and create it if necessary (but not if we
would have to create more than one level of directory; this is likely
to be a typo).  Return the possibly modified directory name, or undef 
on failure to create dir.

E.g., if passed /home/jim/talk/newpodcast and newpodcast doesn't exist
yet, but /home/jim/talk does, it will create the target and return
/home/jim/talk/newpodcast/

=cut

sub verify_dir($) {
    my $dir = shift;
    if ( not $dir ) {
	writelog "Blank directory\n";
	return undef;
    }

    unless ( $dir =~ m!/$! ) {
	$dir .= '/';
    }

    if ( -e $dir ) {
	if ( -d $dir ) {
	    return $dir;
	} else {
	    writelog "$dir exists but is not a directory\n";
	    return undef;
	}
    }

    my $all_but_last = $dir;
    $all_but_last =~ s![^/]+/$!!;
    if ( ! -d $all_but_last ) {
	writelog "$all_but_last doesn't exist yet, so not creating $dir\n";
	return undef;
    }

    writelog "Creating $dir\n";
    if ( mkdir $dir ) {
	return $dir;
    } else {
	writelog "mkdir $dir failed: $!\n";
	return undef;
    }
}

=head2 verify_limit( limit )

Take a podcast limit and check if it's an integer or a fraction.
Return the limit if valid, undef if not.

=cut

sub verify_limit($) {
    my $limit = shift;
    return 0 if not defined $limit;
    if ( $limit =~ m!^ ( 
                    \d+ 
                    |
                    \d+ \s* / \s* \d+ 
	            ) $ !x ) 
    {
	return $limit;
    } else {
	writelog "Invalid limit: $limit\n";
	return undef;
    }
}

=head2 randsleep()

Sleep a random amount of time, varying from half our configured sleep
time to one and a half times that.  Avoid hammering servers too hard,
especially if we're downloading a lot of episodes at once.

=cut

sub randsleep {
    return unless $config{sleep};
    my $sec = int ((rand() + 0.5) * $config{sleep});
    writelog "Sleeping $sec seconds\n" if $config{debug};
    sleep $sec;
}

=head2 startlog()

Initialize the log file.

=cut

{ # scope block surrounding startlog and writelog
my $logfh;

sub startlog {
    return if not $config{logfile};

    if ( defined $logfh ) {
	close $logfh;
	undef $logfh;
    }

    if ( $config{logfile} =~ m/%s/ ) {
	my ($day, $month, $year) = (localtime)[ 3, 4, 5 ];
	$month++;
	$year += 1900;
	my $date = sprintf "%04d-%02d-%02d", $year, $month, $day;
	$config{logfile} = sprintf $config{logfile}, $date; 
    }
    open $logfh, ">>" . $config{logfile}	or die "can't open $config{logfile} for appending";
    my $oldfh = select $logfh;
    $| = 1;		# flush log file on every print
    select $oldfh;
    writelog "\n" . ("=" x 75) . "\n\nstarting $0\n";
}

=head2 writelog ( message )

Write a message to a the log file, standard output, or both depending on the
logfile and quiet configuration variables.

=cut

sub writelog($) {
    my $arg = shift;
    if ( not $config{quiet} ) {
	print STDOUT $arg;
    }

    if ( $config{logfile} && length $config{logfile} ) {
	if ( not defined $logfh ) {
	    startlog;
	}
	$arg = localtime() . "  " . $arg;
	print $logfh $arg;
    }
}

END {
    if ( defined $logfh ) {
	close $logfh;
    }
}

} # end scope block for log filehandle


# ===== routines for copying new episodes to MP3 player and then moving them to long-term storage

=head2 copy_and_move()

Copy new episodes from the download directories to the MP3 player, then move
them to the keep directories.  Return the number of files successfully copied
and moved.

=cut

sub copy_and_move {
    my %podcast = @_;

    state $disk_errors = 0;
    my $dirh;
    unless ( opendir $dirh, $podcast{download_dir} ) {
	writelog "Couldn't read $podcast{download_dir}\n";
	$disk_errors++;
	return 0;
    }
    my $fname;
    my $mp3s_found = 0;
    my $success_fully_copied = 0;
    while ( $fname = readdir $dirh ) {
	my $pathname = $podcast{download_dir} . $fname;
	if ( $pathname =~ m!$extension_regex$! ) {
	    $mp3s_found++;
	    writelog "Fixing to try copy/moving $pathname\n"    if $config{debug};
	    if ( copy $pathname, $podcast{player_dir} ) {
		if ( move $pathname, $podcast{keep_dir} ) {
		    writelog "copied and moved $pathname\n";
		    $success_fully_copied++;
		} else { 
		    writelog "Copied to player okay, but failed to move $pathname to $podcast{keep_dir}: $!\n";
		    $disk_errors++;
		}
	    } else {
    	        writelog "Failed to copy $pathname to $podcast{player_dir}: $!\n";
		if ( $! =~ m/No space left/ ) {
		    writelog "Exiting\n";
		    exit;
		}
		$disk_errors++;
	    }

	    if ( $disk_errors > MAX_ERRS ) {
    		writelog "Exiting after " . MAX_ERRS . " disk errors\n";
		exit;
	    }

	} # end if path is an mp3
    } # end for each file in download directory

    unless ( $mp3s_found ) {
	writelog "No $config{extensions} files found in $podcast{download_dir}\n";
    }
    closedir $dirh;
    
    return $success_fully_copied;
}




# ===== routines for checking RSS feed and downloading new episodes

=head2 save_description_to_file( episode_ref, save_dir, save_name )

Write the title and description of the episode to an html file in the
download directory.

=cut 


sub save_description_to_file {
    my $episode_r = shift;
    my $save_dir = shift;
    my $save_name = shift;
    die if not ref $episode_r eq 'HASH';
    die if not defined $save_name;

    unless ( $episode_r->{title} or $episode_r->{description} ) {
	writelog "No title or description for this episode $save_name\n"; # if $config{debug}
	return;
    }

    my $title = $episode_r->{title} // "";
    my $description = $episode_r->{description} // "";
    
    # change file extension (.mp3, .m4a etc.) to .html
    my $textfile = $save_name;
    $textfile =~ s/\.[^.]+$/.html/;
    if ( $textfile eq $save_name ) {
	$textfile = $save_dir . $textfile . '.html';
    } else {
	$textfile = $save_dir . $textfile;
    }

    # save title and description to text file
    my $descr_fh;
    use open ":utf8";
    if ( open $descr_fh, ">", $textfile ) {
	print $descr_fh qq(<html><head><title>$title</title></head><body>\n\n);
	print $descr_fh qq(<h1>$title</h1>\n\n);
	print $descr_fh qq(<p>$description\n</body></html>);
	
	close $descr_fh;
	writelog "Saved title + description to $textfile\n";
    } else {
	writelog "Couldn't open $textfile for writing\n";
    }
}


=head2 download_episodes ( podcast hash )

Download new episodes of the podcast whose hash is passed as arguments.
Return number of episodes downloaded.

=cut


sub download_episodes {
    my %podcast = @_;

    my @new;
    if ( $podcast{reverse} ) {
	# user has indicated this podcast's episodes are in reverse order in the feed
	# (i.e. oldest ones first)
	@new = @{ $podcast{episodes} };
    } else {
	# reverse order so if we have a limit, we download the oldest ones first
	@new = reverse @{ $podcast{episodes} };
    }

    state $file_err_count = 0;
    
    my $try_count = 0;
    my $succeed_count = 0;

    my $blocklist_fh;
    unless ( open $blocklist_fh, ">>", $podcast{download_dir} . "block-list.txt" ) {
	writelog "Couldn't open $podcast{download_dir}block-list.txt for appending: $!\n";
	# not 100% sure we want to exit rather than return 0.  If we can't write
	# the block list it could be just a problem with this file/directory but
	# it could be something more major that would make it pointless to try
	# downloading other podcasts, either.
	exit(1);
    }

    my $oldfh = select $blocklist_fh;
    $| = 1;		# flush block list file on every print
    select $oldfh;
    
    for my $episode_r ( @new ) {
	my $response;
	writelog "About to download $episode_r->{url} \n";
        if ( ref $podcast{headers} ) {
	    my %headers = %{ $podcast{headers} };
            $response = $ua->get( $episode_r->{url}, %headers );
	} else {
            $response = $ua->get( $episode_r->{url} );
	}
	unless ( $response->is_success ) {
	    my $status = $response->status_line;
	    writelog "failed to download $episode_r->{url}: $status\n";
	    next;
	}

	my $save_name = $episode_r->{url};
	$save_name =~ s!.*/!!;		 # strip the protocol and directory
	#$save_name =~ s!\.mp3.*!.mp3!i;  # strip anything after the file extension like ? & args
	$save_name =~ s!\?.*$!!;	# this works with .m4a, etc. files as well as .mp3

	my $block_name = $save_name;    # save a copy before doing user-specified replacements

	if ( $podcast{replace} ) {
	    # we already tested if it was a valid s/// regex in the main while loop
 	    for my $s ( split /\s*;\s*/, $podcast{replace} ) {
	        #my $eval_string = "\$save_name =~ $podcast{replace}";
		my $eval_string = "\$save_name =~ $s";
		writelog $eval_string . "\n"    if $config{debug};  # > 1
                eval $eval_string;
	        if ( $@ ) {
		     writelog "Problem applying regex $s: $@\n";
	        }
	    }
	}
	
	# should be testing return value of binmode, print, and close too.
	# maybe rewrite this to do a local autodie, then eval{} a block and
	# have a test for errors in place of the else block?

	# or delegate all this to wget
	my $mp3_fh;
	if ( open $mp3_fh, ">", $podcast{download_dir} . $save_name ) {
	    $file_err_count = 0;
	    binmode $mp3_fh;
	    print $mp3_fh $response->content;
	    close $mp3_fh;
	    ++$succeed_count;
	    writelog "Saved $save_name\n";
	    print $blocklist_fh $block_name . "\n";
	    if ( $config{description} ) {
		save_description_to_file( $episode_r, $podcast{download_dir}, $save_name );
	    }
	} else {
	    writelog "Couldn't write $podcast{download_dir}$save_name to disk\n";
	    if ( $! =~ m/No space left/ ) {
	        writelog "Exiting\n";
		exit;
	    }

	    if ( ++$file_err_count > MAX_ERRS ) {
		writelog "Exiting after " . MAX_ERRS . " sucessive open errors\n";
		exit;
	    }
	    next;
	}

	if ( $podcast{limit} && ++$try_count >= $podcast{limit} ) {
	    writelog "Stopping downloading $podcast{name} episodes after limit of $podcast{limit}\n";
	    last;
	}
    } continue {
	randsleep;
    } # end foreach new episode

    close $blocklist_fh;

    return $succeed_count;
}

=head2 get_new_episodes( podcast hash )

Figure out which episodes are new and which we haven't already downloaded (by
checking the block-list.txt file in the individual podcast download
directory), then pass the modified hash with the new episode list to
download_episodes().

=cut 

sub get_new_episodes {
    my %podcast = @_;
    my $dir = $podcast{download_dir};
    my @episodes = @{ $podcast{episodes} };

    my @block_list;
    my $blocklist_fh;
    if ( open $blocklist_fh, "<", $dir . "block-list.txt" ) {
	$/ = "\n";  # turn off paragraph mode for this file
	while ( <$blocklist_fh> ) {
	    chomp;
	    push @block_list, $_;
	}
	close $blocklist_fh;
	$/ = '';   # turn paragraph mode back on before we return to main
	writelog "Got " . (scalar @block_list) . " items from block-list.txt\n";
	writelog join ("\n", @block_list) . "\n"    if $config{debug} > 1;
    } else {
	# this is not necessarily bad, it could mean we've got a new podcast
	# with no block list yet
	writelog "Can't open " . $dir . "block-list.txt.  New podcast?\n";
    }

    if ( $config{debug} > 1 ) {
	writelog "Total episodes of $podcast{name} before applying block list:\n";
	foreach ( @episodes ) {
	    writelog $_->{url} . "\n";
	}
    }

    my @new;
    for my $ep ( @episodes ) {
	#unless ( grep { $episode =~ m/$_/ } @block_list ) {
	unless ( any { $ep->{url} =~ m/$_/ } @block_list ) {
	    push @new, $ep;
	}
    }

    if ( $config{debug} ) {
	writelog "Found " . scalar(@new) . " links not matched in block-list.txt\n" .
		( ($config{debug} > 1 && @new) ? join("\n", @new) . "\n" : "");
    }

    $podcast{episodes} = \@new;
    return download_episodes( %podcast );
}

=head2 debug_headers( request )

Use Data::Dumper to write the HTTP headers from a request object
to the log/standard output.

=cut

sub debug_headers {
    my($request) = @_;
    return unless $config{debug} > 1;

    # according to:
    # https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept
    # this header should tell the site we'll accept whatever they send.
    #Accept: */*

    my %headers = %{ $request->headers };
    writelog "Headers for request:\n" . Dumper( \%headers );
}


=head2 get_mp3_links_from_string( RSS file as string )

Attempt to parse the RSS file using XML::RSS::LibXML, then check for podcast
files in the <enclosure> tags; if none are found, check the<media:content> tags;
and if none are found there, do a regular expression match for mp3 filenames.
Save the title and description (if found in the enclosure or media:content tags)
and return a reference to an array of episode hashes.

=head3 Discussion

Some RSS feeds mention each MP3 URL twice, in a <media:content> tag and
in an <enclosure> tag.  I started using uniq to get rid of the
duplicates.  I don't think we need to sort; sorting by pathname
would probably get the episodes out of order (which might matter if we're
downloading and listening to it gradually, one episode every few days)
and sorting by filename exclusive of path probably would, for any podcast
with inconsistent filenames (which seems to be the majority).

However, further testing with other podcasts indicates that <media:content>
and <enclosure> might be different URLs for different places the same file
is stored.  Maybe one of them is a redirect to the other?  Maybe we want
to go back to checking for <enclosure> tags instead of all mp3 URLs
regardless of what tag they occur in. That failed for some oddly-formatted
podcasts, but we could have a fallback to the more greedy method of looking
for URLs if the conservative method of looking at <enclosure> tags doesn't
find any.

=cut


sub get_mp3_links_from_string {
    my $pagecontent = shift;
    my @episodes;
    my $parser = XML::RSS::LibXML->new;

    # for some bizarre reason, putting curly brackets around this eval generates
    # syntax errors.  use q// instead.
    eval q/ $parser->parse($pagecontent) /;
    
    if ( $@ ) {
	writelog "Could not parse page as XML/RSS: $@\n";
	$parser = undef;
    }

    if ( $parser ) {
    
	foreach my $item (@{ $parser->{items} }) {
	    my $ep;
	    if ( defined $item->{enclosure} ) {
		if ( $ep = $item->{enclosure}{url} and $ep =~ m!$extension_regex$! ) {
		    push @episodes, { url => $ep };
		} elsif ( $ep = $item->{media}{content}{url} and $ep =~ m!$extension_regex$! ) {
		    push @episodes, { url => $ep };
		}
		next if not $ep; 
	    } else {
		next;
	    }

	    if ( $config{description} ) {
		$episodes[ $#episodes ]->{title} = $item->{title};
		$episodes[ $#episodes ]->{description} = $item->{description};
	    }
	} # end for each <item>
    } # end if we have a valid parse

    unless ( @episodes ) {    
	writelog "Found no $config{extensions} files by parsing XML, checking via regex for any $config{extensions} links in any context\n";
        my @mp3s = uniq ( $pagecontent =~  m/(http[^\s>]+$extension_regex)/gi );
	return undef unless ( @mp3s );
	foreach ( @mp3s ) {
	    push @episodes, { url => $_ };
	}
    }
    
    return \@episodes; # @mp3s;    
}


=head2 check_rss_feed_via_wget( podcast hash )

Download the RSS feed using a system call to C<wget>, check it for new episodes.
Returns the number of episodes downloaded.

For now this is a fallback for if LWP::UserAgent gets an error trying to get an RSS feed
at some point I might make this the primary way to get RSS feeds and/or podcast files.

=cut

sub check_rss_feed_via_wget {
    my %podcast = @_;

    writelog "Going to try getting $podcast{name} RSS feed via wget\n";
    my $rssfile = File::Temp::tempnam( $podcast{download_dir}, 'rss_' );
    my @cmd = ( 'wget', $podcast{url}, "--output-document=$rssfile" );
    my $rc = system(@cmd);
    unless ( $rc == 0 ) {
	if ($? == -1) {
	    writelog "wget failed to execute: $!\n";
	} elsif ($? & 127) {
	    writelog sprintf "wget died with signal %d, %s coredump\n",
		($? & 127),  ($? & 128) ? 'with' : 'without';
	} else {
	    my $true_rc = $? >> 8;
	    writelog sprintf "wget exited with value %d\n", $true_rc;
	}	
	return 0;
    }

    my $rss_file_content;
    { # nameless block to localize record separator for slurping
	local $/;
	my $rss_fh;
	open $rss_fh, '<', $rssfile;
	$rss_file_content = <$rss_fh>;
	close $rss_fh;
	unlink $rssfile unless $config{debug};
    }

    	    # check for mp3 file links
	    # should factor out the logic for checking for different kinds of
	    # links to a different function and call it both her and in check_rss_feed()

    my $episodes_r = get_mp3_links_from_string( $rss_file_content );
    if ( $config{debug} ) {
        writelog "Found " . ( ref $episodes_r ? scalar( @$episodes_r ) : 0 ) . " MP3 file links in page\n";
    }
    
    if ( ref $episodes_r && @$episodes_r ) {
        $podcast{episodes} = $episodes_r;
        return get_new_episodes( %podcast );
    } 

    return 0;
}


=head2 check_rss_feed( podcast hash )

Download the RSS feed using C<LWP::UserAgent::get>, then call
get_mp3_links_from_string() to check it for new episodes and download them.
Returns the number of episodes downloaded.

=cut

sub check_rss_feed {
    my %podcast = @_;

    unless ( $podcast{download_dir} =~ m!/$! ) {
	$podcast{download_dir} .= '/';
    }

    writelog "Downloading RSS feed for $podcast{name}\n";

    my $response;
    if ( ref $podcast{headers} ) {
        my %headers = %{ $podcast{headers} };
        $response = $ua->get( $podcast{url}, %headers );
    } else {
        $response = $ua->get( $podcast{url} );
    }
    
    if ( not $response->is_success ) {
	writelog "Couldn't get RSS feed for $podcast{name}: " . $response->status_line . "\n";
	# if we've got any 400 error other than 404, double-check the RSS using wget.
	# the specific one I've had trouble with is 406 (Unacceptable) on Make Ours Marvel.
	if ( $response->status_line =~ m/4[0-9][012356789]/ ) {
	    return check_rss_feed_via_wget( %podcast );
	} else {
	    return 0;
        } 
    }

    writelog "Content-type: " . $response->header("content-type") . "\n";

    unless ( $response->header("content-type") =~ m/xml|rss/i ) {
	writelog "Page content is not RSS or XML\n";
	my ($day, $month, $year) = (localtime)[ 3, 4, 5 ];
	$month++;
	$year += 1900;

	my $bad_rss_file = sprintf "%sindex-%04d-%02d-%02d.xml",
	    $podcast{download_dir}, $year, $month, $day;

	my $badrss_fh;
	if ( open $badrss_fh, ">", $bad_rss_file ) {
	    writelog "writing to $bad_rss_file\n";
	    print $badrss_fh $response->content;
   	    close $badrss_fh;
	} else {
	    writelog "open $bad_rss_file failed\n";
  	}
    }
    
    my $episodes_r = get_mp3_links_from_string( $response->decoded_content );

    if ( $config{debug} ) {
        writelog "Found " . ( ref $episodes_r ? scalar( @$episodes_r ) : 0 ) . " $config{extensions} file links in page\n";
    }
    if ( ref $episodes_r && @$episodes_r ) {
        $podcast{episodes} = $episodes_r;
        return get_new_episodes( %podcast );
    } 

    return 0;
}


=head2 main

Parse command line options, read the config file and parse it, verify the
validity of the config variable values, and for each valid podcast block, call
check_rss_feed() or copy_and_move(), then report number of episodes
downloaded/copied.

=head3 Discussion

This should probably be refactored.  The main function is over two hundred
lines of code.

Possible ways to refactor, and advantages thereof:

1. Parse the config file in a while loop and save the podcast hashes
in an array, then download or copy the podcasts in a for loop (which 
could be in new functions called by an if ( $copymode) else structure).
Both of those could be separated out into functions called from main.

2. Iterate over the config file with a line-by-line while loop (not
para-by-para).  This would let us print line numbers for each parsing
error, and line number ranges for problems like missing variables.
We could also save the podcast line number range in each %podcast hash
and let the download/copy functions report those line numbers when
it would help the user debug an error (like a bad directory or regex).
Of course, we currently print the podcast name whenever possible, so that
might not be much more help.

3. Both

=cut


#RFC7230
# tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / 
#      "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
my @http_header_valid_punctuation = ( '!' , '#' , '$' , '%' , '&' , "'" , '*' , '+' , '-' , 
				      '.' , '^' , '_' , '`' , '|' , '~' );
my $header_punct_backslashed = quotemeta( join '', @http_header_valid_punctuation );
my $http_header_regex = qr(^ ( [ A-Z a-z 0-9 $header_punct_backslashed]+ ) \s* : \s* (.*))xx;

Getopt::Long::Configure('bundling');

GetOptions( \%config,
	    'sleep|s=i',
	    'debug|d=i',
	    'description|D',
	    'logfile|l=s',
	    'quiet|q',
	    'copy|c',
	    'help|h',
	    'manual|m',
	    'extensions|e=s',
	    'agent|a=s',
) or usage; # die "Bad command line arguments\n";

usage if $config{help};
pod2usage(-verbose => 2) && exit  if $config{manual};

startlog;

set_extension_regex( $config{extensions} );

if ( $config{debug} ) {
    for ( keys %config ) {
	writelog "Config key $_ = $config{$_}\n";
    }
}

$ua = LWP::UserAgent->new;
if ( not ref $ua ) {
    # write log and exit
    writelog "unable to create UserAgent object\n";
    exit(1);
}

# show a progress bar on all downloads if running in terminal
$ua->show_progress( 1 );

if ( $config{agent} ) {
    writelog "setting user-agent to $config{agent}\n" if $config{debug};
    eval {  $ua->agent( $config{agent} ); };
    if ( $@ ) {
	writelog "Setting user-agent to $config{agent} failed ($@)\n";
	$ua->agent( DEFAULT_AGENT );
    }
} else {
    writelog "setting user-agent to " . DEFAULT_AGENT . "\n" if $config{debug};
    $ua->agent( DEFAULT_AGENT );
}
    
    
#$ua->agent('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0');
$ua->add_handler( request_prepare => \&debug_headers );
 
my @required_variables = qw( name download_dir );
my @permitted_variables = ( @required_variables, qw( player_dir keep_dir url pause limit replace header reverse ) );
my @config_variables = qw( sleep description logfile quiet debug extensions agent );

if ( $config{copy} ) {
    writelog "Running in copy/move mode\n";
    push @required_variables, qw( player_dir keep_dir );
} else {
    writelog "Running in check RSS/download mode\n";
    push @required_variables, qw( url );
}

# for now, get config file from command line after all switch args have been parsed.
# later might make it a command line arg if we need to read more than one file for
# different purposes.

my $episodes_downloaded = 0;
my $episodes_copied = 0;
my @podcasts_with_new_episodes;

$/ = ''; 	# paragraph mode
POD: while ( <> ) {
    my $paragraph = $_;
    my %podcast;
    my $found_config_vars = 0;
    writelog "starting new config item\n"  if $config{debug};

    next if m/^\s*#\s*TEMPLATE/;    
    
    if ( my $comments = s!\#(.*)$!!gm  &&  $config{debug} > 1 ) {
        writelog "Removed $comments comments\n" 
	    . "Paragraph after removing comments:\n"
	    . $_;
    }

    # strip trailing spaces from each line. can't use \s because we don't want to
    # remove the newlines.
    s/[ \t]+$//gm;

    if ( m/^\s*$/s ) {  # paragraph is empty after removing comments
	writelog "empty paragraph\n" if $config{debug} > 1;
	next;
    }

    my $vars_count = 0;
    while ( m/ ^ \s* (\w+) [ \t]* = [ \t]* ([^\n]*) /gmx ) {
	my ($variable, $value) = ( $1, $2 );
	$vars_count++;
	if ( any { $variable eq $_ } ( @permitted_variables ) ) {
	    if ( $variable eq 'header' ) {
		if ( $value =~ $http_header_regex ) {
		    my ( $header, $hvalue ) = ( $1, $2 );
		    $podcast{headers}->{ $header } = $hvalue;
		    writelog "$header = $hvalue\n" if $config{debug} > 1;
		} else {
		    writelog "Bad format for header: $value\n";
		}
	    } else {
	        $podcast{ $variable } = $value;
	        writelog "$variable = $value\n" if $config{debug} > 1;
	    }
	} elsif ( any { $variable eq $_ } @config_variables ) {
	    # this has to be handled before we assign $valule to the config hash entry
	    # otherwise we can't write the final log message if it's being set to null
	    if ( $variable eq 'logfile' ) {
		if ( $value ) {
		    startlog;
		} elsif ( $config{logfile} ) {
		    writelog "Ending logging as logfile is set to null\n";
		}
	    }
	    $found_config_vars = 1;
	    $config{ $variable } = $value;
	    writelog "$variable = $value\n" if $config{debug} > 1;

            if ( $variable eq 'extensions' ) {
		set_extension_regex( $value );
	    } elsif ( $variable eq 'agent' ) {
                writelog "Setting user-agent to $config{agent}\n";
		eval {  $ua->agent( $config{agent} ); };
		if ( $@ ) {
		    writelog "Setting user-agent to $config{agent} failed ($@)\n";
		    $ua->agent( DEFAULT_AGENT );
		}
	    }
	} else {
	    writelog qq(Illegal variable "$variable"\n);
	}
    }

    if ( not $vars_count ) {
	writelog "No valid variable = value pairs found in:\n$paragraph\n";
	next POD;
    }
    
    if ( $found_config_vars ) {
        next POD;
    }

    $found_config_vars = 0;

    if ( exists $podcast{replace} ) {
	unless ( test_regex( $podcast{replace} ) ) {
	    delete $podcast{replace};
	}
    }

    ###TODO test this with valid and invalid limits    
    if ( exists $podcast{limit} ) {
	unless ( defined ($podcast{limit} = verify_limit( $podcast{limit} ) ) ) {
	    next;
	}
    }

    my $missing_fields = 0;
    foreach ( @required_variables ) {
	unless ( exists $podcast{$_} ) {
	    writelog "Missing $_ variable in config paragraph\n";
	    $missing_fields++;
	}
    }
    next if $missing_fields;

    ###TODO: should this also test if we're in copy mode?
    # Or should we have separate pause variables for pausing downloading and
    # pausing copying?
    if ( $podcast{pause} ) {
	writelog "$podcast{name} is on pause so we're skipping it\n";
	next;
    } else {
	writelog "Working on $podcast{name}\n";
    }
    
    # don't check the player_dir if we're not in copy mode.  The MP3 player
    # probably isn't mounted and we would skip this podcast (next POD;) if we couldn't
    # create a directory on it.
    my @dirs_to_check = ( 'download_dir' );
    push @dirs_to_check, qw( player_dir keep_dir )  if $config{copy};
    foreach ( @dirs_to_check ) {
	if ( exists $podcast{$_} ) {
	    my $good_dir = verify_dir( $podcast{$_} );
	    if ( $good_dir ) {
		$podcast{$_} = $good_dir;
	    } else {		
		writelog "Bad directory $podcast{$_}\n";
		next POD;
	    }
	}
    }
    
    # if the limit is e.g. 1/3, 2/7 etc, download N podcasts every D days
    
    # rather than check if we're in copy mode here, maybe move this test to the
    # beginning of check_rss_feed() or inside the if ( $config{copy} ) else block
    # later in main?

    if ( !$config{copy} && $podcast{limit} && $podcast{limit} =~ m!(\d)+/(\d+)! ) {
	my ($numerator, $denominator) = ($1, $2);
	my $days_since_epoch = int( time / SECS_PER_DAY);
	if ( $days_since_epoch % $denominator == 0 ) {
	    $podcast{limit} = $numerator;
	} else {
	    writelog "day $days_since_epoch is not divisible by $denominator so won't download any episodes of $podcast{name}\n";
	    next;
	}
    }
    my $episodes_of_this_podcast = 0;
    if ( %podcast ) {
	if ( $config{copy} ) {
	    $episodes_of_this_podcast = copy_and_move( %podcast );
	    $episodes_copied += $episodes_of_this_podcast;
	} else {
	    $episodes_of_this_podcast = check_rss_feed( %podcast );
	    $episodes_downloaded += $episodes_of_this_podcast;
	}
	if ( $episodes_of_this_podcast ) {
	    push @podcasts_with_new_episodes, $podcast{name};
	}
    } else {
	# this should probably not be reached due to the @required_variables check above
	# but just to be safe
	writelog "Bad configuration format:\n$paragraph";
    }
}


if ( $config{copy} ) {
    writelog "Copied/moved $episodes_copied total episode files\n";
} else {
    writelog sprintf "Downloaded %d total episodes %s", $episodes_downloaded,
        $episodes_downloaded > 0
	    ? "of " . (join ', ', @podcasts_with_new_episodes) . "\n"
	    : "\n";
}


