Mike, bad headlines, see below -
I don't know if they got through. (Skip bad header to content) It was nice
talking to you last night.
I think around 20-100 lines of
easy perl code could:
1. create a directory listing of html
files exported by a feed reader (or search reader yet to be created
YTBC).
2. run tidy on each file. Pandoc
needs good html, so clean up with tidy first.
3. run pandoc on each file outputting
markdown and sending all results to one file.
4. parse the "big" file using
perl. the construct s/[(mm)]\((mm)\)/ parses all markdown urls, where mm are
regular expressions, and the variables $1 and $2 contain the link name and
url respectively, so perl processing markdown should be easy.
First process urls's using
# second pass code to process "grand" file
for url's.
while(<input_markdown_file>)
{
chop;
if ( s/\[(mm)\]\((mm)\) ) { #
markdown is easy to recognize,
#
parenthesis puts () content into
# variables $1 and $2
print "Url is $2
and title is $1\n"; # info to stdout
print FU2
"<br><a href='$2'>$1</a>\n"; # or something like this
#
to print html file of
# links and titles
print FU3 "wget
$parms $2\n" # print wget file, which you can execute
}
}
so each url is on its own line, and
then run the above line-oriented script to extract any found url. This is
much easier than parsing html. I have not tested this perl code: it is
only written in the email tool.
5. Save a list of the urls and run
wget on each url to get the big file.
6. Then repeat algorithm on list of
wget'ed files.
I have not done this yet, but this is
the next plan. It's the next evolutionary step. Funny I was already working on
this before you called, so I am glad you called. If I succeed in doing this I
will send you the perl script.
== KISS means keep it simple
Iltis
g.