WP: Footnote renumbering for WordPress.com posts

Preamble

Idea for a PERL script to automatically renumber footnote references that take the form:

This is because a tuple is required8:

or in HTML:

<sup><a href="https://stackoverflow.com/a/66372700/4424636">8</a></sup>

These footnote references have to be entered by hand. If a new footnote reference is added (inserted) before a series of other footnotes then the successive footnote references require renumbering. Renumbering by hand is a major chore.

The script was actually written in Python, using BeautifulSoup.

Code available on Gitlab:

See also

The General Idea

  • Get the HTML – download from site
  • Set counter to 1 (number of the first footnote)
  • Scan through the HTML looking for the <sup></sup> sequence – further check for anchor: Basically a regex pattern
  • Check the number of the footnote reference in the pattern against the counter
  • Modify and adjust anchor text if required
  • Once pattern is found and checked, increase counter
  • Continue scan
  • Ad infinitum
  • Print out the new HTML

This HTML is then pasted into the Text tab of the WP (classic) editor – over writing the existing text.

Use the following URL:

https://<your blog>.wordpress.com/wp-admin/post.php?post=<post_id>&action=edit&classic-editor

to edit properly. To access this classic editor, use:

https://<your blog>.wordpress.com/wp-admin/edit.php

The new, and horrific, (block) editor is not supported

Known issues

Footnotes section

Note that the references will be reordered. If there is a “Footnotes” section at the bottom of the file, listing all of the footnotes (which also uses the <sup>xxx</sup> format) then that would be renumbered too. Unless the following marker is used, to denote the end of the footnoted text and the start of the Footnotes section:

<!--- Footnotes --->

Once this marker has been detected, the counter is reset to zero and if there are more footnotes, for example in a list summary, and they are of the same format, then these are checked and renumbered (if necessary).

Combined footnotes

If there is a sequence of combined footnote references within one <sup></sup> block, then only the first reference number is re-ordered:

<sup><a href="https://www.w3schools.com/python/python_datetime.asp">5</a>,<a href="https://www.journaldev.com/23365/python-string-to-datetime-strptime#python-strptime">4</a>,<a href="https://stackoverflow.com/a/37042999/4424636">5</a></sup>

One solution is to wrap each footnote:

<sup><a href="https://www.w3schools.com/python/python_datetime.asp">5</a>,</sup><sup><a href="https://www.journaldev.com/23365/python-string-to-datetime-strptime#python-strptime">4</a>,</sup><sup><a href="https://stackoverflow.com/a/37042999/4424636">5</a></sup>

Not ideal.

Code has been fixed in check_the_order2(the_sups) and fix_the_order2(the_sups) by replacing:

    for sup in sups:
        anchor = sup.find("a")
        utils.logit(f'anchor:\n{anchor}')
        anchor_text = anchor.text 
        ...

with

    for sup in sups:
        anchors = sup.find_all("a")
        for anchor in anchors:
            utils.logit(f'anchor:\n{anchor}')
            anchor_text = anchor.text
            ...

Much better. Fixed v1.4

Other formatting issues

The following are all similar issues to those that were seen when writing a previous WP tool in Perl, see Create table of contents for WordPress.com posts and Create an external “links” section for WordPress.com posts and (maybe) Reading WordPress pages in Perl.

After using File Merge to compare the original text in the Text tab of the classic editor, to the text output by version 1.4 of the script, I noted that the following (annoying) conversions:

  • <!--more--> is replaced by <span id="more-26196"></span>
  • &nbsp; are lost and become NBSP
  • Some characters, such as, single quotes, double quotes, ellipses (three full stops, three periods), hyphens: '"...- – all become “smart”, i.e. “smart quotes”, etc., once read by the script and then output to text file: ’“…–. A similar issue was found in Reading WordPress pages in Perl. The ellipsis may not be a too big an issue, nor the en-dash, as WP will probably auto-fix, but it would be nice to also correct these programmatically – fixed,  see Removing converted characters below.

The following are less of an issue, as these should be fixed by WP auto formatting when switching from the Text to Visual tabs (and back again):

  • Spaces after the final quote in an image tag, i.e. <img src ..." /> becomes <img src ..."/> – not a problem, as reinstated by WP?
  • The indenting tabs, on list items <li> only, are lost – not a problem, as reinstated by WP?
  • Blank lines are removed
  • In <pre>...</pre> blocks, there could be a different line ending (i.e. \r or \n) as file merge highlights the whole to the remaining blank space of a line (note that the actual code in the pre block is not highlighted at all).

Links used for issue solving


Using:

PyCharm 2021.1.1 (Community Edition)
Build #PC-211.7142.13, built on April 21, 2021
Runtime version: 11.0.10+9-b1341.41 x86_64
VM: Dynamic Code Evolution 64-Bit Server VM by JetBrains s.r.o.
macOS 10.13.6
GC: ParNew, ConcurrentMarkSweep
Memory: 990M
Cores: 4
Non-Bundled Plugins: com.jetbrains.nim (1.4.0-203), com.jetbrains.plugins.ini4idea (211.6693.44)

Other interesting Python links

Links for Issues

Smart quote issues

See also Removing Smart Quotes below.

Comments

Removing converted characters

Smart Quotes

From this answer to Replace all smart quotes in Beautiful Soup

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))

There is another method.

en-dash

The - , which is a U+002D : HYPHEN-MINUS {hyphen or minus sign}, is replaced with , which is (according to this site, What Unicode character is this ?) a U+2013 : EN DASH.

The solution would seem to be to modify the remove_start_quotes() code above.

Ellipsis

The ... is replaced with , which is U+2026 : HORIZONTAL ELLIPSIS {three dot leader}.

The solution would seem to be to modify the remove_start_quotes() code above.

The more

Unfortunately, the <!--more--> comment is replaced by <span id="more-26196"></span>

This answer to Beautiful Soup 4: How to replace a tag with text and another tag?

for span in soup.select('span[id]'):
        # insert sup tag after the span
        sup = soup.new_tag('sup')
        sup.string = span['id']
        span.insert_after(sup)

        # replace the span tag with its contents
        span.unwrap()

However, the id will not be the same for different WP pages, so search on the partial id, see find tags that has partial id value using BeautifulSoup:

soup.select('span[id*="more-"])

You would assume that this might work:

comment = soup.new_tag('comment')
# or
comment = soup.new_tag(Comment)
# and
comment.string = 'more'
span.insert_after(comment)

but no. However, from the docs (and this answer to Python Beautiful soup insert comment in html), this works:

from bs4 import Comment
comment = Comment("more")

Note that no error is given, but the comment does not change

# This doesn't work
comment.string = 'more'

The final solution is

    for span in soup.select('span[id*="more-"]'):
        # insert comment tag after the span

        from bs4 import Comment
        comment = Comment("more")

        span.insert_after(comment)

        # replace the span tag with its contents
        span.unwrap()

The final <div> wrapper

The content is wrapped by <div class="entry-content></div>, and this div needs to be shed.

Use unwrap(), again from this answer to Beautiful Soup 4: How to replace a tag with text and another tag?

However, this time, it doesn’t work as expected, you either get just the tag, or nothing at all, using this variety of six methods:

    # AttributeError: 'NoneType' object has no attribute 'unwrap'
    # last_div = soup.find("div", {"class": "entry-content"})
    # last_div.unwrap()

    # AttributeError: 'NoneType' object has no attribute 'unwrap'
    # last_div = soup.find("div", {"class": "entry-content"})
    # return last_div.unwrap()

    # return soup.unwrap()  # Returns just the tag!
    # These two lines also, returns just the tag!
    # soup.unwrap()
    # return soup

    # AttributeError: 'NoneType' object has no attribute 'unwrap'
    # div_tag = soup.div
    # div_tag.unwrap()

    # AttributeError: 'NoneType' object has no attribute 'unwrap'
    # div_tag = soup.find("div", {"class": "entry-content"})
    # div_tag.unwrap()

This question, Python Beautiful Soup unwrap() not working as expected – want to extract content of a tag, sums up the issue perfectly.

unwrap() requires a top level “wrapping” tag that wraps the whole soup. However, for the WP Text edit tab, we require unwrapped text, there is no top level wrap. So this answer to the above question provides the fix:

str1=''
for item in soup.find('div',id='content').children:
    str1=str1+str(item)

print(str1)

although we need

for item in soup.find('div', {"class": "entry-content"}).children:

Because of this, this final unwrapping, or removal of the enclosing <div> needs to be done last, after all soup manipulation is complete.

Also, this means that the final object is now a string and not a soup – up to (and including) v1.5 the save and the copy to clipboard functions took the soup as an argument (as did the final print out of the content). This means that these routines will need to be changed as well, in v1.6.

Unusual side effects

Link creation from a URL

I noted that version 1.5 (and possibly 1.4 and maybe prior versions also) converted a URL (i.e. https://realpython.com/python-web-scraping-practical-introduction/) into a clickable URL (i.e. https://realpython.com/python-web-scraping-practical-introduction/)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s