Reading WordPress pages in Perl

Preamble

I wanted to add some functionality to the WordPress to table of contents Perl script to grab the HTML code of the WordPress page, rather than require the user to manually copy the HTML from the Text tab of the WordPress editor.

However, whilst researching how to add this functionality, I found that a simple get(), using LWP::Simple, doesn’t work when trying to retrieve WordPress pages, and empty string is returned:

Use of uninitialized value $page in print at ./GetPage.pl line 13.

See also

Links

PerlMonks

StackExchange

Issue

This code:

#!/usr/bin/perl -w

# From https://www.perlmonks.org/bare/?node_id=276032

use strict;
use warnings;
use LWP::Simple;

my $page = get 'https://gr33nonline.wordpress.com/2020/12/23/create-an-external-links-section-for-wordpress-com-posts/';

print $page;

produces this error:

Use of uninitialized value $page in print at ./GetPage.pl line 13.

Initial attempt to fix – failure…

Initially I thought it was that the user agent of LWP wasn’t accepted by WordPress. So either: change the user agent:

use LWP::Simple qw($ua get);
$ua->agent('My agent/1.0');
my $url = "http://en.wikipedia.org/wiki/Hotel";
my $html = get $url || die "Timed out!";

Or use LWP::UserAgent

use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $req = new HTTP::Request GET => 'http://en.wikipedia.org/wiki/Hotel
+';
my $res = $ua->request($req);
my $content = $res->content;
#----------------------
print "$content\n";
#----------------------

So, for the first example, like this:

#!/usr/bin/perl -w

# From https://www.perlmonks.org/bare/?node_id=276032

use strict;
use warnings;

use LWP::Simple qw($ua get);
$ua->agent('My agent/1.0');

my $page = get 'https://gr33nonline.wordpress.com/2020/12/23/create-an-external-links-section-for-wordpress-com-posts/';

print $page;

However, this still gave the same error.

I didn’t bother using LWP::UserAgent.

Getting to the bottom of the error

Using the tip from this post, and using getprint() instead of get() gives this error:

500 Can't verify SSL peers without knowing which Certificate Authorities to trust <URL:https://gr33nonline.wordpress.com/2020/12/23/create-an-external-links-section-for-wordpress-com-posts/>

So the issue is actually the SSL issue.

There are two easy ways to solve this, insecure and secure, as stated in this answer to Perl LWP::Simple::get($url) does not work for some urls.

Insecure fix

Use HTTP::Tiny. This post states SSL certificate checking is disabled by default.

But HTTP::Tiny does require a whole raft of other modules (JSON and Data::Dumper ), which you may not have installed – they aren’t part of the core.

This code, from this post, works. I have commented out the additional JSON and Data::Dumper modules and associated code:

#!/usr/bin/perl -w

use strict; use warnings;
use HTTP::Tiny;

#use JSON;
#use Data::Dumper;

my $url = "https://gr33nonline.wordpress.com/2020/12/23/create-an-external-links-section-for-wordpress-com-posts/";
my $res = HTTP::Tiny->new->get( $url );

print $res->{'content'};

#my $decoded_json = decode_json( $res->{'content'} );
#print Dumper( $decoded_json );

This does actually produce the web page… the whole web page. However WordPress clearly adds a lot of junk to the code. See Cleaning out the WordPress mess below.

Secure fix

The second lot of code from this post:

use strict;
use warnings;
use LWP::UserAgent;
use IO::Socket::SSL;

my $ua = LWP::UserAgent->new;
$ua->ssl_opts(
SSL_fingerprint => 'sha256$70bca153ac950b8fa92d20f04dceca929852c42dc1d51bdc3c290df256ae05d3',
SSL_ocsp_mode => SSL_OCSP_NO_STAPLE,
);
my $resp = $ua->get('https://www.cryptopia.co.nz/api/GetCurrencies');
print $resp->decoded_content;

However, the use of SSL_OCSP_NO_STAPLE gives me a syntax error. Nevertheless it would also require the user to paste in the SSL fingerprint:

The fingerprint you see here is the one you can also see in the browser when looking at the certificate.

Which is a bit of a pain.

Another set of solutions are answered here, Retrieving HTTP URLs using Perl scripting.

Cleaning out the WordPress mess

The start of the site’s actual body is delimited by <!-- .entry-header --> followed by <div class="entry-content">.

However, the end isn’t so well defined, with the last line of the body followed by:

<div id="atatags-370373-5ffda23775bab">
<script type="text/javascript">
__ATA.cmd.push(function() {
__ATA.initVideoSlot('atatags-370373-5ffda23775bab', {
sectionId: '370373',
format: 'inread'
});
});
</script>
</div> <div id="atatags-26942-5ffda23775c0e"></div>

So there is a requirement to filter out, or trim, all of the excess WordPress baggage, before the script can work its table of contents and/or links section magic.

This is now turning into a bit of a monumental task, and not a quick 5 minute hack that I had anticipated.

The contents of $res->{'content'}; needs to be split, using /^/, before being placed in the array.

Nevertheless, this is the code to strip out the WordPress excess:

my $filename = "out.html";
my @lines;
my $flag_found_contents = 0;
@lines = split( /^/, $res->{'content'} );
open(FILE, ">", $filename );
foreach (@lines) {
  if ( $flag_found_contents && $_ =~ /<div id="atatags-/ ) { 
    $flag_found_contents = 0; # Found end of content
    # Set this *before* the printing if, so that the *current* line *is not* printed
  }
  if ($flag_found_contents) {
    print FILE $_;
  } 
  if ( $_ =~ /<div class="entry-content">/ ) {
    $flag_found_contents = 1; # Found start of content
    # Set this *after* the printing if, so that the *next* line *is* printed
  }
}
close(FILE);

Notes

Once the WordPress junk has been cleared out, there are some differences to the source HTML and the actual downloaded HTML:

  • The more tag
  • First line indentation
  • Paragraph tags
  • Missing blank lines associated with the paragraph tags
  • Some characters replaced by HTML (unicode) code
  • List items lose indentation

The more tag

The <!-- more--> gets replaced by

<p><span id="more-24176"></span></p>

To fix this, replace

print FILE $_;

with

if ($_ =~ /<p><span id="more-.*"><\/span><\/p>/){
  print FILE "<!-- more-->\n";
} else{
  print FILE $_;
}

Note the greedy match (.*) to take out the number, which changes from post to post.

First line indentation

The first line is indented, by two 8 space tabs, i.e. sixteen space characters. The indentation is two tabs in the $res->{'content'}; which become 16 spaces when written to the output file.

This requires an additional flag, which is set when we first detect the contents, and then for the next lines we check to see if the flag is set, and if so then strip out preceding spaces and also reset the flag (to zero).

if ($flag_found_contents) {
  if ($flag_first_line){
    $flag_first_line = 0;
    $_ =~ s/^\s*(.*)/$1/; # Strip out the indentation
  }
...

User uses atatags-

If a div with an id element equal to “atatags-” appears in the body of the text then the file writing will fail prematurely.

 

 

The complete WordPress excess stripping code

This can be tacked on to the end of the insecure code shown above (which uses HTTP::Tiny):

my $filename = "out.html";
my @lines;
my $flag_found_contents = 0;
my $flag_first_line = 0;

@lines = split( /^/, $res->{'content'} );
open(FILE, ">", $filename );
foreach (@lines) {
  if ( $flag_found_contents && $_ =~ /<div id="atatags-/ ) { 
    $flag_found_contents = 0; # Found end of content
    # Set this *before* the printing if, so that the *current* line *is not* printed
  }
  if ($flag_found_contents) {
    if ($flag_first_line){ 
      $flag_first_line = 0; 
      $_ =~ s/^\s*(.*)/$1/; # Strip out the indentation 
    }     
    if ($_ =~ /<p><span id="more-.*"><\/span><\/p>/){ 
      print FILE "<!-- more-->\n"; 
    } else { 
      print FILE $_; 
    }  
  } 
  if ( $_ =~ /<div class="entry-content">/ ) {
    $flag_found_contents = 1; # Found start of content
    # Set this *after* the printing if, so that the *next* line *is*  printed
    $flag_first_line = 1; 
  }
}
close(FILE);

Further comparison of the two HTML pages

At this point, the HTML is more or less the same, and could probably be pasted back into WordPress without any detrimental effects. In addition, the WP editor will fix any remaining issues, when the user switches back and forth between the Visual and Text tabs in the editor: the <p></p> tags removed (NEED TO CHECK THIS) and the <li></li> indentation restored (verified), and the character codes would probably be automagically fixed (verified). Note: I am using the old style editor with the following URL:

https://<your_site>.wordpress.com/wp-admin/post.php?post=<post#>&action=edit&classic-editor

Chasing down the differences now starts to have diminishing returns.

Paragraph tags

If you now compare the HTML from the Text tab, of the WordPress editor, with that downloaded/generated by the script, one major difference is seen. The HTML from the Text tab does not contain any <p>...</p> tags, whereas the downloaded HTML does. It seems to be the only HTML tag that is omitted (in the original).

That is rather annoying, as I was hoping that they would be facsimiles of each other.

As this downloaded/generated HTML code is to eventually be pasted back into the Text tab (after processing for the table of contents and/or the links section), then the solution would seem to be to strip out the <p></p> tags.

Replace

print FILE $_;

with

$_ =~ s/<p>(.*)<\/p>/$1/; # Strip out the paragraph HTML tags
print FILE $_;

Blank lines

Once the paragraph tags have been removed, the issue now is the missing blank lines in the downloaded/generated HTML, which only appear (in the original) after the line, if there isn’t a following HTML tag (such as <pre> or ?).

This would seem to require another flag and some post -processing of the previous line: if there are paragraph tags, then remove them, and set the paragraph found flag. Then for the following line, if there is no HTML tag then print a blank line, print the current line, and reset the flag.

So adding a conditional and a flag setting to the <p></p> tag stripping code above

if ( $_ =~ /^<p>/ ) {
  $flag_paragraph_found = 1;
}
$_ =~ s/<p>(.*)<\/p>/$1/; # Strip out the paragraph HTML tags
print FILE $_;

and then both resetting the flag and putting in blank lines when needed

if ($flag_paragraph_found) {
  $flag_paragraph_found = 0;
  if ( $_ =~ /^</ && $_ !~ /^<p>/ ) {
    # do nothing
  } else {
    print FILE "\n";
  }
}

Characters replaced by HTML (unicode) code

Some characters are replaced by their unicode. Here are some examples:

  • ' replaced by ’ (’)
  • - replaced by – (–)
  • smart quotes replaced by “ and ” (“) and (”)
  • and so on.

From this answer to the question How can I decode HTML entities?, there is a code example, although this doesn’t fix the unicode, and you get some odd results:

use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";

The other modules that could be used (Text::Unidecode (from this answer) and Text::Iconv (from this answer)) require adding via CPAN. From this answer:

use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);

my $source = '北亰';  
print unidecode(decode_entities($source));

# That prints: Bei Jing 

As this answer states, converting unicode to ASCII is also nigh impossible – see joelonsoftware.com/articles/Unicode.html. So, seeing as WordPress automagically fixes these unicodes anyway, once the HTML is pasted back into the Text tab of the editor, and the user switches between the Text and Visual tabs a couple of times, there is really no need for the script to fix these. After all, why do more work than you need to?

List items lose indentation

Loss of indentation of the HTML list items

blah

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s