Create table of contents for WordPress.com posts

Preamble

Whilst writing particularly long posts I realised that a table of contents (ToC) would be most useful, based on the headings in the post.

It wouldn’t be that hard to do, in Perl.

Code on GitLab.

See also

Dev stages

  1. Pull out headings to create unlinked ToC – Strip contents of heading tags
  2. Indent ToC entry relative to the heading level, i.e. H1, H2, H3, etc.
  3. Modify existing text, insert id in heading HTML tags. Number by heading_level.occurence_number, i.e. first H1 => id = “1.1”, second H2 => id = “2.1”, third H1 => id = “1.2”. Use heading index array to keep track of next id tag number.
  4. Add link to headings in ToC
  5. Output both the ToC and the modified HTML source to two separate  (temporary?) files.
  6. Merge these two files into one file.
  7. Write a ToC removal script – to revert the changes (remove the ToC and strip out the inserted id HTML elements.

Foreseen issues

  • Unstructured (i.e. non-sequential) headings (i.e. H2 followed immediately by H4, instead of H3 and then H4) may give weird indentation in ToC
  • Re-running the script on a page which has already has a ToC inserted. Existing id tags will need to be detected/acknowledged and modified, rather than just added Existing contents section will have to be ignored, and stripped out and re-written, or modified. Use a separate remove_existing_ToC script that also rips out any id tags of the form “X.X+”. Restores HTML page to as it was before the script was ever run. Delimit ToC within <hr> tags – be aware of <hr \> auto-replace.
  • Adding links to headings in ToC
  • Inserting id tags.
  • May need to replace indenting whitespaces with &nbsp;. However, these get automatically stripped out by WP, when switching from Text to Visual mode and back. So indentation is lost.
  • Could use nested lists – however, as shown in Getting a nest list correctly – in WP Classic Editor the formatting can be automatically lost.
  • <hr> automatically becomes <hr /> in WP.
  • <br> automatically removed by WP – check this, maybe need <br />???
  • WP uses lower case HTML tags
  • When writing out the lines for the contents then \n is required. when writing out the existing HTML source, no \n is required as there is already one present, at the end of the line.
  • If heading tag contains any other information, i.e. a class element, then this may be lost
  • If heading tag already contains an id element, then this may be overwritten/lost – notify user

Other thoughts for implementation of strip out

  • Use HTML comments to delimit the contents? <-- start/end ToC -->?
  • Use shared variables for inserted strings so that they can be removed easily – shared code between insert and strip scripts. Everything that is added, should be in a (constant) string and not hard coded. See section on Constant Strings below.
  • To avoid stripping out user added id tags, use a unique identifier, like ToC_Heading_? Such as id="ToC_Heading_2.3". This unique identifier is contained in $heading_identifier.
  • When re-running, run strip out first? This will save on having code having to parse for, and skip, the existing ToC, as well as the already inserted id tag elements. It will make the code a lot simpler to re-read the script to strip out first.
  • Simple arguments or GetOpts? could have default arguments like “s s”. Arguments are three filenames, insert, remove, verbose, debug, version, usage, help.

Process

Some manual pre- and post processing is required, by the user, in steps 1-2 and 8:

  1. Manually copy your entire page from the Text tab
  2. Paste in to a text file
  3. Run a parsing script on the file
  4. Script produces a Table of contents
  5. Script produces a copy of the page but with HTML5 id tags inserted
  6. Create the links in the TOC
  7. Merge the TOC into the top of the page (probably just after the Read More tag).
  8. Paste the merged text into the Text tab of the WordPress.com post page

Pseudo code

Extracting the headings

Using Perl or a script, the process for finding the headings and their respective levels would be as follows:

foreach $line
    // If line starts with "<H" or "<h" 
    $line_content =~ /^<h.>(\w+)</
    // Get the heading content
    $line_content = $1

    // For the heading level, indent the new ToC line accordingly
    $line =~ /^<h(\d)/i
    $indent_value  = $1
    $spaces = indentLine($indent_value) // this adds indent_value number of spaces
    //addstring($spaces,$line)         // prepend spaces to $line_content
    output (append) $line_content to ToC file

Combining the two matches as they are especially the same

for my $line
  // If line starts with "<H" or "<h"
  if ($line =~ /^<h(\d)>(\w+)</)
    // Get the indent value
    $indent_value = $1
    // Get the heading content
    $line_content = $2

    // For the Heading level, indent the new ToC line accordingly
    $spaces = indentLine(indent_value) // this adds indent_value number of spaces
    //addstring($spaces,$line)   // prepend spaces to $line_content
    output (append) $line_content to ToC file

Creating the id reference

Setting up the id reference, using the heading value ($indent_value) and an incremented array element. Each time a heading of a particular level is encountered, then the value of the array with a corresponding index (i.e. the index corresponding to the heading level) is incremented

# Create heading id reference
$heading_index[$indent_value]++;
$heading_occurrence = $heading_index[$indent_value];
$heading_id = $indent_value.".".$heading_occurrence;

A unique identifier prefix can also be added, depending whether a flag is set or not

if($useUniqueHeadingIndentifier){
  $heading_id = $heading_identifier.$indent_value.".".$heading_occurrence;
}
else {
  $heading_id = $indent_value.".".$heading_occurrence;
}

Making the contents items links

Wrapping the table of contents items in anchor links, which reference the id, is simple enough – pre- and post-pending the heading content with the HTML anchor tag

print OUTPUTFILE $line_indent."<A HREF=\"#".$heading_id."\">".$line_content."</A>"."\n";

Creating the id element

Creating the id element – pre- and post-pending the heading id reference with the HTML id tag

# Create heading id tag using the id reference
my $id_element = "id=\"".$heading_id."\"";

Inserting the id tag into the headings

The regular expression for the insertion of the id tag element in the heading

s/^<h(\d)>/<h$1\ $id_element>/

Note: instead of maybe I should use \s, like so,

s/^<h(\d)>/<h$1\s$id_element>/

Note also: Maybe I don’t need the backspace, i.e. instead of use just  . (see [oak perl] Backslash Space in Regex?)

Using this regular expression to insert the id element into the heading tag

 my $line_plus_id = $line;
 $line_plus_id =~ s/^<h(\d)>/<h$1\ $id_element>/;

Currently the ToC is written to one file and the source HTML is re-written (with the modified headings now including the id tags) to another file. The ToC is then merged with the modified HTML (either by manually pasting or a currently unimplemented script merge).

Removal of the ToC

Removal of the table of contents

This is done by checking for the sequence of first line <hr /> followed by <h2>Contents</h2>, then stripping out all lines until the subsequent <hr />.

Note: If this order has been modified by the user, then the script will not work.

Removal of id elements from headings

To remove the inserted id elements from the headings, use the following regular expression

s/^<h(\d).*>/<h$1>/

or, more accurately, using the $heading_identifier variable, to avoid removing user added tags

s/^<h(\d).*$heading_identifier.*>/<h$1>/

Note: If the script added identifiers have been modified at all then the script will probably not work.

Constant strings

These are used to avoid hardcoded HTML tags in print statements:

# Constant string for unique heading id identifier # TODO unique id variable set but not used yet
my $heading_identifier = "ToC_Heading_";

# Constant strings for HTML anchor wrap
my $anchorwrap_open_start="<A HREF=\"#";
my $anchorwrap_open_end = "\">";
my $anchorwrap_close = "</A>";

# Constant strings for HTML id wrap
my $idtagwrap_start = "id=\"";
my $idtagwrap_end = "\"";

# Constant strings for Contents HTML tags
my $horizontal_rule = "<hr \/>\n";
my $line_break = "<br />\n";
my $contents_heading = "<h2>Contents<\/h2>\n"; 

blah

Unresolved issues

These are WordPress.com related, when switching between the Text and Visual tabs in the wp-admin classic editor:

  • indenting &nbsp; removed automatically
  • line breaks <br> or <br /> are removed automatically

No real solution, especially if using WordPress.com (see 10 Things I Hate About WordPress (And How to Fix Them) and How To Stop WordPress From Removing Your Paragraph Tags and Line Breaks).

Merge

How to insert the content of a file into another file before/after a pattern match? I found this code didn’t work too well, but it gave me inspiration for my implementation

Version History

Mission completed in version 0.1!

However there are some issues:

  • Confusing amount of filename variables, and default names – consolidate? Or is it more readable to have specific filename variables for each function, even if some are essentially the same, i.e. insert_html_output is the same as merge_html_input, insert_toc_output is the same as merge_toc_input. If overwriting, then merge_html_input is the same as merge_html_output.
  • Consolidate insert and merge, then insert_html_input is the same as insert/merge_html_output.
  • NOTE: However, you can’t read and write the same file, so you have to do a move afterwards, so they aren’t really the same files after all (at least not until all actions have been completed).
  • Confusing arguments passed, not consistent, default action is insert – list the various modes and associated filenames arguments. Having a default action is confusing matters as the action option removes the first filename argument space
  • Merge should have three filenames, not necessary, but should be able to specify the output filename (unless you want to overwrite the html input file)
  • The removal is rather complex (and unreadable) when removing both the id elements and the table of contents at the same time. It could be better to first remove one and then the other, i.e. first the ToC and then the id elements – this order removes the ugly check for contents heading.
  • Need to decide whether we want a plethora of files as output, or to start overwriting them and deleting the “temp” files. Use a switch? – list the various files required and output files created:
  • Inconsistent variable naming (camelCase and underscores)
  • Use of unique id reference is not switchable with passed argument
  • Debugging mode
  • Verbose mode
  • Use of Const (Strings) – prefix the variable names?
  • Use of Flag (Booleans) – prefix the variable names?
  • Removal makes no use of the constant strings which were used for insert/create
  • Next version will need to use GetOpts [Ed. – actually not implemented until v0.2.7]

Version 0.2

  • insert_toc() renamed to create_toc()
  • Basically there are four actions: create, merge, and remove – with the fourth, insert, being a combination of create and merge. The function create could be split into two further sub functions: generate (ToC) and link. The function generate could then be used to just create an outline of the text body. However, currently, there is no link capability to link an existing unlinked contents (i.e. outline), so link (or, rather, “generate-and-link”) is essentially the same as create. Whilst it would be an interesting academic exercise to write the code to link a generated unlinked outline, it is rather pointless, as it will merely duplicate the function of the “generate-and-link” code. That is to say that link (on its own) serves no purpose without generate.
  • The action re-run (or re-insert) is a combination of remove, create and merge.
  • However, the arguments, and their handling is difficult due to the lack of inconsistency and the use of default for create. It will be much easier is their is no default and option c is used for create. So, four arguments are used: the option and the three filenames (two filenames for remove, the third name can be ignored)

Version 0.2.5

  • Now there is a consistent number of arguments – default action removed. only g and r require two files, the rest require three.
  • generate_toc() moved to separate function, as it required less files/filehandles. Outptut to out_html not required. Now generate flag redundant, as is most of code from create_toc().
  • clean_all_files() added, to delete output files.
  • run_test.sh created for testing the various command line arguments. This was then put into a Perl subroutine in the script that calls the script itself, repeatedly (once for each argument), self_test().
  • pod used
  • Uses pod $VERSION
  • Added open_file() and close_file(). See Send file handle as argument in perl for passing filehandle to subroutine

Shell script: run_test.sh

#!/bin/sh 

run_one_script()
{
  OPTION=$1
  shift; 
  # Having shifted twice, the rest is now comments ...
  FILENAMES=$@

  ./ToC_v0.2.5.pl $OPTION "$FILENAMES"


}
run_each_script()
{
  set -x

  ./ToC_v0.2.5.pl c s s s
  ./ToC_v0.2.5.pl m s s s
  ./ToC_v0.2.5.pl r s s  
  ./ToC_v0.2.5.pl g s s  

  set +x
}

run_each_script

The above script implemented as a perl subroutine

sub self_test(){
  my $filename = $0;
  print (__FILE__);
  print ("Calling: $0\n");
  system ($0); # usage
  print ("Calling: $0 h\n");
  system ("$0 h"); # help
  print ("Calling: $0 g s s\n");
  system ("$0 g s s"); # generate
  print ("Calling: $0 c s s s\n");
  system ("$0 c s s s"); 
  print ("Calling: $0 m s s s\n");
  system ("$0 m s s s"); 
  print ("Calling: $0 i s s s\n");
  system ("$0 i s s s"); 
  print ("Calling: $0 a s s s\n");
  system ("$0 a s s s"); 
  print ("Calling: $0 r s s\n");
  system ("$0 r s s"); 
}

Version 0.2.6

  • Makes use of const strings (used in create_toc()) for remove_toc().
  • Cleaned up generate_toc(). Removed the linking code, and the body writing and the flags. HOWEVER: It would be possible to combine generate_toc() and create_toc(), if a flag was used – you would have to ensure that only two file handles are used for generation. However, it is probably better to have a separate generate_toc() as the use of flags would make the code more messy. See create_toc2() (or create_toc3() – if we don’t want to mess with create_toc2())
  • The const string for anchor regex should contain a .* to catch any class HTML tags included in the anchor tag

Version 0.2.7

  • Major refactoring, and bug fixes
  • GetOpt implemented and used (provisionally). Only used for debug and verbose – to clear the -d and -v options. Not used for actual action arguments (yet)
  • Debug and Verbose levels implemented correctly.
  • MakeIndent
  • Old (commented out) code removed
  • Consistent naming of variables and subroutines (camel case removed)
  • No real new features added

Version 0.2.8

Version 0.2.9

Version 0.2.9.1

  • Added link section creation, generation, merging, removal, etc. See Create an external “links” section for WordPress.com posts
  • Added more checking routines – look ahead for ToC and links sections checks
  • Added process checking – can not remove a links section if ToC exists.
  • Many changes, tidying and refactoring

Version 0.2.9.2

Now, to pass the command line arguments to the self_test() we can either:

  • Recreate the arguments by checking to see if $opt_xxx is setting and then recreating the argument string from those checks – not entirely accurate as this method could lead to not the same number of -d options being passed (if there were multiple instances) [however, using $debug we could recreate], or;
  • We save the argument string (or recreate by parsing $ARGV), and stripping out the action argument. This is more accurate.

POD

I had a number of issues getting POD to work correctly.

Issues

  • If there is a =begin at the top that doesn’t work (no previous blank line) then the rest of the perl script isn’t parsed
  • blank lines are needed before and after (see How To View POD Embedded in a Perl Script?)
  • blank lines are needed within the pod, around =item * and =over 4, unlike as shown in the examples. This page shows the spaces correctly, https://perldoc.perl.org/5.6.2/perlpod
  • The EPIC checker gave a warning about “verbatim paragraph in NAME section”, when I comments out some code with “=head1  NAME and =cut“, see https://perldoc.perl.org/5.6.2/perlpod. Warning removed by using “=head1 DESCRIPTION
  • multiline comments with =begin comment, =end comment, =cut or =pod and =cut?
  • Formatting doesn’t work, i.e. B<...> doesn’t bold (when used in =head1 NAME) and, using this example, C<...> just puts the contents in double quotes (or using this example then C<$a> is shown without the quotes). Using perldoc on OS X (see strange output characters from perldoc.

Can’t view POD within EPIC, see Perl in Eclipse.

Pod Links

Other useful Perl related links

Scope links

GetOpts links

Combine GetOpts and Usage

How can you make use of the command line arguments defined for GetOpt in the usage() subroutine? The solution seems to be to use Getopt::Declare module

Or use Pod::Usage with Getopt::Long, which doesn’t seem so good.

Unfortunately Getopt::Declare isn’t installed by default and requires installing a module using cpan, cpan Getopt::Declare – which is additional work.

However, I found that this installs at /Library/Perl/Updates/5.18.2/darwin-thread-multi-2level/

$ cpan Getopt::Declare
...
Running make install
Installing /Library/Perl/5.18/Getopt/Declare.pm
Installing /usr/local/share/man/man3/Getopt::Declare.3pm
Appending installation info to /Library/Perl/Updates/5.18.2/darwin-thread-multi-2level/perllocal.pod
FANGLY/Getopt-Declare-1.14.tar.gz
/usr/bin/make install -- OK
$

This directory doesn’t appear to be in @INC , at least not on my Mac, which results in an error in Eclipse. Therefore you must make use of the lib module, to add the directory to the @INC path:

use lib "/Library/Perl/Updates/5.18.2/darwin-thread-multi-2level/";
use Getopt::Declare;

Although, once I added this use lib line, then I was able to comment it out and the error against the use Getopt::Declare line no longer appeared.

Compiling and running a test script using Getopt::Declare, showed that the usage and help is indeed rather well integrated, however, the interface is rather ugly, as it seems to drop into PerlDoc. Using either the usage example or this PerlMonks example, the script shows a blank screen, with an inverse END at the top (if the user then hits space or enter or any key, then the inverse END is at the bottom), until the user hits ‘q’. Then the usage is displayed correctly.

One thing to note is that the tab spacing shown in Eclipse doesn’t seem to line up with the tab spacings that are output in OS X Terminal.

Debugging links

Case ignore links

Testing

Subroutines

Other Perl Links

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s