Create an external “links” section for WordPress.com posts

Preamble

Following on from my partial success (issues due to the automatic HTML formatting of WordPress are causing the lack of total success) of the table of contents (ToC) creator, see Create table of contents for WordPress.com posts. I then thought that it would be useful to have a script that pulls out all of the (external) links referenced and places them under the heading of “Links”…

Process

The process would be more or less the same as the table of contents script, but this time instead of searching, using regex, for <Hx> and </Hx>, the search would be for <A HREF=""> and </A>, and copying both the anchor tags and what is between the start and end tags into a Links.html file. Then this file could be merged into the main body, in the same way as the ToC is merged. Obviously, generate, create, remove, insert options will all apply, as they did with the ToC script.

Obviously the script would be written in Perl.

Initial Thoughts

Now one issue is that the ToC script has become rather complex, even though I have tried to modularise it. However, re-visiting it for the purpose of extracting links will give reason to refactor it for the better.

Now there may be a conflict with the links added by the ToC and this is where the uniqueHeadingIndentifer string would come into play. All links with that would be excluded. However, these links are only in the ToC itself, so the ToC section, if one exists, could be skipped. In fact it is probably better to run the ToC script after the Link script, so that the Links section is added to the ToC.

The links would need to be added in an unordered HTML list, <ul>, and here problems may arise. As I documented in Getting a nest list correctly – in WP Classic Editor, WordPress has a nasty habit of reformatting the list, and not for the better. However, as there should be no need for nesting (initially) then there may be no issue.

However, later, it may be desirable to group the links, i.e. StackExchange links, Arduino forum, Processing examples, etc. Here, nesting would be required, and so WordPress may cause ugly reformatting.

Modifying the ToC script

The Links functionality was added to v0.2.9 of the ToC.pl script

Regex

So to find a link and to print or save it to a file… My first stab was:

my link="";
if (my $line ~= /(<a\shref=".*<\a>)/i){
    $link=$1
    # Print only link contents 
    print $link."\n" if $debug > 2 || $verbose > 0; # Debug/verbose 
    print OUTPUTFILE_LINKS $link."\n"; 
};

But this is too greedy, if there are more than one link, then it grabs everything between the first <a href and the last </a>, so use ? to make less greedy:

my link="";
if (my $line ~= /(<a\shref=".*?<\a>)/i){
    $link=$1
    # Print only link contents 
    print $link."\n" if $debug > 2 || $verbose > 0; # Debug/verbose 
    print OUTPUTFILE_LINKS $link."\n"; 
};

Now not so greedy, but only picks up one link in a line, that may contain multiple links. Changing the if to foreach

my link="";
foreach (my $line ~= /(<a\shref=".*?<\a>)/ig){
    $link=$1
    # Print only link contents 
    print $link."\n" if $debug > 2 || $verbose > 0; # Debug/verbose 
    print OUTPUTFILE_LINKS $link."\n"; 
};

This is better, but if a line contains two links then the first link is not printed and instead it prints the second link in a line twice. I wonder if the line had three links, would the third link be printed three times (whilst the first two were skipped)?

Removing the g doesn’t help and produces the same output as the if (second example).

Adding m at the start didn’t help either.

The solution was to not use $1 but to add a local variable, my $link, to the foreach:

foreach my $link ($line =~ m/(<a\shref=".*?<\/a>)/ig){
    # Print only link contents
    print $link."\n" if $debug > 2 || $verbose > 0;  # Debug/verbose
    print OUTPUTFILE_LINKS $link."\n";
}

These solutions were from Perl, match one pattern multiple times in the same line delimited by unknown characters.

In addition, any element, i.e. class and/or style-scope, between the a and href, causes the link not to match. So adding a non-greedy catcher, in place of the \s for these cases:

foreach my $link ($line =~ m/(<a.*?href=".*?<\/a>)/ig){
    # Print only link contents
    print $link."\n" if $debug > 2 || $verbose > 0;  # Debug/verbose
    print OUTPUTFILE_LINKS $link."\n";
}

This seemed to fix it! Now all the links are printed

Remove duplicate links

Now to remove duplicate links, from this post to Removing repeated lines from file:

open(INFH, "<filename.txt");
my @data=<INFH>;
close(INFH);
open(OUTFH, ">outfile.txt");
my %hashdata = ();
foreach my $thisline (@data)
{ $hashdata{$thisline} = 1; }
foreach my $thisline (sort(keys(%hashdata)))
{ print(OUTFH, $thisline) }
close(OUTFH);

but this gets some crazy re-ordering. Removing the sort doesn’t help.

or from this post from Removing duplicate lines from files (was ‘Files’), performs the task in memory, and would avoid the use of an additional file

open FILE, $filename or die "Can't open $filename\n";
my @lines = <FILE>;
close FILE;
my @new_list = keys map {$_ => 1} @lines;
...
open (FILE, ">$filename") or die "Can't open $filename\n"; 
foreach (@new_list) {
    print FILE "$_\n";
}
close (FILE); 

Unfortunately, this gave an error

Type of argument to keys on reference must be unblessed hashref or arrayref at ./ToC_v0.2.9.pl line 1004.

The fix is to add {}, see Perl: Type of argument to keys on reference must be unblessed hashref or arrayref or Type of argument to keys on reference must be unblessed hashref or arrayref

open FILE, $filename or die "Can't open $filename\n";
my @lines = <FILE>;
close FILE;
my @new_list = keys {map {$_ => 1} @lines};
...
open (FILE, ">$filename") or die "Can't open $filename\n"; 
foreach (@new_list) {
    print FILE "$_\n";
}
close (FILE); 

However, the crazy re-ordering happens again.

Now this post from Removing repeated lines from file:

my %read_lines=();
while(defined($_=<FILE>)){
    if(!defined($read_lines{$_})){
        print OUTFILE $_;
        $read_lines{$_}=1;
    }
}

actually seems to work, although it does create an additional file. A similar but shorter example is from this post from Remove Duplicate Lines

my %lines; 
#open DATA, $ARGV[0] or die "Couldn't open $ARGV[0]: $!\n"; 
while (<DATA>) { print if not $lines{$_}++; }

However, it should be possible to by first reading the file into an array,

my @lines = <FILE>;

then re-opening the same file to overwrite and then replacing the while

while(defined($_=<FILE>)){

with foreach (in this case the $_ comes from the foreach), see Perl array printing: How do I print the entire contents of an array with Perl?

foreach(@lines){

like so

open FILE, $filename or die "Can't open $filename\n"; 
my @lines = <FILE>; 
close FILE;
open (FILE, ">$filename") or die "Can't open $filename\n"; 
my %read_lines=();
#while(defined($_=<FILE>)){
foreach(@lines){
    if(!defined($read_lines{$_})){
        print FILE $_;
        $read_lines{$_}=1;
    }
}
close (FILE); 

Eh viola! The file is cleared of duplicates without the need for an additional (scratch) file.

It could be possible to do the search in memory, en lieu of doing it whilst writing to a file, and write the non-duplicate to another array. However, there doesn’t seem much point.

The only problem is that it removes the </hr> from the bottom, as there is one at the top! So, I have to re-add that line (by appending it to the file) after the duplicate line removal.

Adding list formatting

Now the HTML list formatting needs to be added, pre- and post-fix with list item HTML tags:

# Print link contents in a list item
print $list_item_open.$link.$list_item_close."\n" if $debug > 2 || $verbose > 0;  # Debug/verbose
print OUTPUTFILE_LINKS $list_item_open.$link.$list_item_close;

Merge and remove subroutines

Now to duplicate the merge and the remove actions of the ToC script for the links file – which is a relatively straightforward copy… although remove_links() only removes the inserted “Links” section, and doesn’t need to modify the body, as remove_toc() needs to (by removing the id elements from the HTML headings).

Issues with merge

One problem with just duplicating merge_toc() is that the order in which the ToC and the Links sections are added will determine which section comes before the other, as merge simply inserts the Links or the ToC just after the <-- more--> HTML tag. Really, we want Contents to be before the Links section. So logically it would make sense to first add the links section and then add the table of contents – that way, at least an entry for the Links section will also end up in the table of contents.

Issues with remove

Similarly, another problem, is that remove_links() will also remove the ToC, as they both remove the section subsequent to the <-- more--> HTML tag, if there is a <hr /> tag. What is really needed is a look-ahead function to check to see if the heading after the <hr /> tag is Contents or Links. This can not be performed by a line-by-line check. The solution would seem to be this answer to Match multiple line string in Perl.

I don’t want to be lazy and make use of a <!--Links--> or <!--Contents--> type HTML comment.

The problem is:

  1. If you wait until the heading, Links or Contents, then we have already written out the preceding <hr />
  2. When we get the first <hr /> how to know what the next line is, whether it is a heading for Contents or Links, or something else entirely? In the latter case, we certainly don’t want to start stripping out lines, and in the first two cases, we want to be able to chose between them.
  3. We could read two or three lines to check for both <hr /> and <h2>Contents</h2> or <h2>Links</h2> but then we have missed the chance of printing these two lines out, if there isn’t a match – unless we save them to an array and then print them out (in the same manner as the non-duplicate lines are printed from the array).

A messy way is to do this (note $links_heading contains the Links heading string):

open_all_files_for_remove();
my $flag_dump_links = 0;
while (<INPUTFILE_HTML>) {
    print $_ if $debug > 2 || $verbose > 0;  # Debug/Verbose 
    $line = $_;
    if ($line =~ /^<!--more-->$/i) {
        # Look ahead two lines for <hr /> and Links
        my $line_1 = <INPUTFILE_HTML>;
        my $line_2 = <INPUTFILE_HTML>;
        if  ($line_2 =~ /^$links_heading$/i) {
        # The Links section has started
            $flag_dump_links = 1; 
        } else {
            print OUTPUTFILE_HTML $line;
            print OUTPUTFILE_HTML $line_1;
            print OUTPUTFILE_HTML $line_2;
        }
        print OUTPUTFILE_HTML $line;
    } elsif ($flag_dump_links && $line =~ /^<hr \/>$/i) {
        # The Links section has ended
        $flag_dump_links = 0;
    } elsif ($flag_dump_links) {
        # Do nothing, just dump the line
    } else {
        # Print (copy) existing line into "new" HTML body
        print OUTPUTFILE_HTML $line;
    }
}
close_all_files_for_remove();

However, while this method of checking would work well for the Contents removal (by swapping $links_heading with $contents_heading), this solution will not work if there is also a table of contents before the Links section, which would normally be the case. This method will not work because the search for the Links heading is triggered by the <-- more--> HTML tag.

So, this method works well for a check to make sure we aren’t removing the table of contents when we actually want to remove the links section, but how to check for a links section which is placed after a table of contents – and to remove the Links ToC entry from the table of contents..? Well, maybe we don’t need to perform an even more complex search than the one above.


[ As an aside: Maybe we don’t even need the above check for the specific heading..? Well, yes we do, for both remove_toc() and remove_links(), because we don’t want to inadvertently remove the contents, or some other section, when trying to remove the links section, and likewise, we don’t want to remove the links section, or some other section, when trying to remove the contents. The previous basic check for just a horizontal rule after the <-- more--> HTML tag is a little too vague and ambiguous.]


One simple solution would be not to programmatically search for Links sections after a Contents section, but rather to programmatically enforce the order in which the Contents and Links sections are added and removed. Like so:

  1. First add the links section, straight after the <-- more--> HTML tag.
  2. Then add the Contents, straight after the <-- more--> HTML tag.

This method works well, because, as stated in Issues with merge, both the ordering of the sections makes sense and there will be an entry in the ToC to the links section.

If, for some reason, you then want to remove the Links section, but still have a table of contents, you must

  1. First, remove the Contents
  2. Then remove the links section
  3. Finally, recreate and merge the Contents

This method works well, because the entry in the ToC to the links section will be removed automatically, upon the re-creation of the ToC (as the links section no longer exists in the HTML body).

I added some subroutines to check for already existing tables of contents and links sections: check_for_section(), links_already_inserted(), toc_already_inserted(), check_then_merge_links(), and check_then_merge_toc().

All of these functions were added to version 2.9.1. See perldoc for that version.

Usage of Getopt::Declare

Usually, Getopt::Declare uses a string $specification. From https://metacpan.org/pod/Getopt::Declare

$specification = q(

-a Process all data

-b <N:n> Set mean byte length threshold to <N>
{ bytelen = $N; }

+c <FILE> Create new file <FILE>

--del Delete old file
{ delold() }

delete [ditto]

e <H:i>x<W:i> Expand image to height <H> and width <W>
{ expand($H,$W); }

-F <file>... Process named file(s)
{ defer {for (@file) {process()}} }

=getrand [<N>] Get a random number
(or, optionally, <N> of them)
{ $N = 1 unless defined $N; }

-- Traditionally indicates end of arguments
{ finish }
);

And invoked by:

use Getopt::Declare;
$args = Getopt::Declare->new($specification);

or

use Getopt::Declare, $specification => $args;

or

use Getopt::Declare;
$args = Getopt::Declare->new($specification_string, $optional_source);
# or:
use Getopt::Declare $specification_string => $args;

It is not clear what $optional source is.

A couple of different ways of declaring $specification are shown in Getopt::Declare vs Getopt::Long:

Readonly my $ARGS => Getopt::Declare->new(
  join( "\n",
    "[strict]",
    "--engineacct <num:i>\tEngineaccount [required]",
    "--outfile <outfile:of>\tOutput file [required]",
    "--clicks <N:i>\tselect keywords with more than N clicks [required]",
    "--infile <infile:if>\tInput file [required]",
    "--pretend\tThis option not yet implemented. "
    . "If specified, the script will not execute.",
    "[ mutex: --clicks --infile ]",
  )
) || exit(1);

or

use Getopt::Declare;
Getopt::Declare->new(<<"EOPARAM");
  [strict]
  --client <client:i>\tclient number [required]
  --clicks <clicks:i>\tclick threshold (must be > 5)
EOPARAM

bhh

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s