Plotting SE3DP stats

Preamble

A graph plotter for statistics, taken from the Area51 site over time, relating to the Stack Exchange 3D Printing (SE3DP) web site.

A web scraper for the Area51 site – to obtain the statistics and save as CSV format – is also included.

Available on GitLab: testkins/se3dp_plotterscraper

Synopsis

Graphing script that uses either a CSV list sourced locally (in the script) or from the web (from this answer to this question, What does it take to get out of Beta stage?  on the SE3DP meta site), to produce a variety of graphs. The data scraped from the Area51 site, using the web scraper script, can also be used as a data source. These graphs are pre-set and are enabled by setting the appropriate flag.

The data can be scraped from the page in one of three ways:

  • Regex
  • String Search
  • BeautifulSoup

The plotted graphs can be saved as PDF or PNG.

[Should we be able to save the pandas DataFrame?]

Historical data

The script currently uses locally/remotely manually saved data. It would be good to be able to query Area51 for historical data, but I don’t know if this is possible, see Graphs showing Area51 stats over time. This answer to Why are site stats graph for beta sites not available to public? shows that an SQL query can be used: Query for 300 weeks. Also, this query isn’t exactly the same as the Area51 stats.

What is required is a modified SQL query, that pulls the past Area51 beta stats, each week, with the questions/day, answers/question, etc. averaged over that week, rather than a snap shot of one day of the week (which may not be representative of the rest of the week).

See also:

Graph types

Style

  • Line
  • Stack – solid under line (doesn’t work well, nor make sense, for combined data sets – except for User reputation)

Size

The size is set in inches, and specified in the graphSize parameter

graphSize = (8,5)  # (width, height) in inches

Datasets

  • All on one graph – makes no sense as the units and scales are different
  • Individual
  • Combined:
  • Users rep – easy as they have the same units
  • Questions per day and Answers per question – similar units
  • Questions per day and Answers per question with either Answer rate or Visits per day
  • Questions per day and Answers per question with both Answer rate and Visits per day

Note using these two plots

  • Users rep – stacked or line
  • Questions per day and Answers per question with both Answer rate and Visits per day

means that one two graphs are required and may both be plotted within one figure.

Combined graphs

Plotting more than two graphs on the same axes requires more than two Y axis with different units and scales.

Where two graphs, such as Questions per day and Answers per question, have similar units and scales, then they can be plotted using just one Y axis, which in this case has a similar range for both and the same units (which is units (just an integer), in this case).

If the ranges are wildly different, just as Visits per day, which is in the order of thousands, then that will lead to the other plot being squashed, and not particularly useful for visualising the data variation.

However using a second Y axis1 on the right hand side of the plot means we can add either a plot with different units (such as Answer rate) or with a very different scale (such as Visits per day), to the combined Questions per day and Answers per question graph – thereby creating a useful visualisation.

However, to add both Answer rate and Visits per day), to the combined Questions per day and Answers per question graph would require a third Y-axis… which is known as a parasite axis2.

However, unfortunately, when I plot the example, while the parasite axis is displayed, the second “twin” temperature axis (on the immediate right hand side of the plot) is not:

This answer to multiple axis in matplotlib with different scales [duplicate] would seem to offer a solution. This solution offers at least two advantages: it uses no extra libraries and it shows how to save a PDF of the plot.

It is easily adapted to add an additional Y-axis (making 4 in total). However, using legend with the loc='best' modifier, the legend is placed over the plots of two of the line graphs, when using:

host.legend(handles=lns, loc='best')

It is possible to actually “fine tune” the location of the legend using:

ax2.legend(loc=(0.65, 0.8))

where the numbers are a proportion of the width and height respectively.

For either the three or four plots, it may be possible to not display one axis if that makes sense (i.e. the scales and ranges are the same, for example questions per day and answers per question). However, then there may be a missing axis label (Is it possible to hide a Y-axis while still keeping the label, i.e. use two labels on one Y-axis?). Also, as noted in the conclusions, using the same axis for questions per day and answers per question distorts the graph and maybe not so useful (answers per question is very flattened).

For three plots only, the three datasets are chosen by setting the index in the list flagThreePlotDataSet[]. Note that the indexing starts at 1, not 0:

"""
Parasitic 3 Y-axis Graphing Permutations:

1 - Questions per day
2 - Answer Rate
3 - Users200
4 - Users2k
5 - Users3k
6 - Answers per question
7 - Visits per day
"""
flagThreePlotDataSet = [1, 2, 5]

Other notes

  • No need for legend on single data set graphs, as the title shows the name of the data set, so omitted.
  • Graph size set to  ??? x ??? – different graphs need differing sizes, extra parasite y-axes require greater width. Standard is (8,5) 3 Y-axis is (8, 5), 4Y needs more

x-ticks (label) rotation

A number of ways of rotating x-axis labels, are given in this answer to Rotate axis text in python matplotlib.

Also this answer to How to rotate x-axis tick labels in Pandas barplot suggests:

ax.set_xticklabels(df['Names'], rotation=90, ha='right')

But it has issues

However, these give endless annoying issues:

This

plt.rc('xtick', ha='right')

is not ok. but this is (although this doesn’t work on ax)

# plt.xticks(rotation=cfgDateRotation, ha='right')

But this

ax.tick_params(axis='x', labelrotation=cfgDateRotation, ha='right')

causes

ValueError: keyword ha is not recognized; valid keywords are ['size', 'width', 'color', 'tickdir', 'pad', 'labelsize', 'labelcolor', 'zorder', 'gridOn', 'tick1On', 'tick2On', 'label1On', 'label2On', 'length', 'direction', 'left', 'bottom', 'right', 'top', 'labelleft', 'labelbottom', 'labelright', 'labeltop', 'labelrotation', 'grid_agg_filter', 'grid_alpha', 'grid_animated', 'grid_antialiased', 'grid_clip_box', 'grid_clip_on', 'grid_clip_path', 'grid_color', 'grid_contains', 'grid_dash_capstyle', 'grid_dash_joinstyle', 'grid_dashes', 'grid_data', 'grid_drawstyle', 'grid_figure', 'grid_fillstyle', 'grid_gid', 'grid_in_layout', 'grid_label', 'grid_linestyle', 'grid_linewidth', 'grid_marker', 'grid_markeredgecolor', 'grid_markeredgewidth', 'grid_markerfacecolor', 'grid_markerfacecoloralt', 'grid_markersize', 'grid_markevery', 'grid_path_effects', 'grid_picker', 'grid_pickradius', 'grid_rasterized', 'grid_sketch_params', 'grid_snap', 'grid_solid_capstyle', 'grid_solid_joinstyle', 'grid_transform', 'grid_url', 'grid_visible', 'grid_xdata', 'grid_ydata', 'grid_zorder', 'grid_aa', 'grid_c', 'grid_ds', 'grid_ls', 'grid_lw', 'grid_mec', 'grid_mew', 'grid_mfc', 'grid_mfcalt', 'grid_ms']

Final (temporary) label rotation solution

In the end I had to use an alternate method using a combination of three techniques: a (fixed)formatter6, a (fixed)locator (plt.setp()) and scaled translation6:

    # Using formatter
    import matplotlib.dates as mdates
    myFmt = mdates.DateFormatter('%m-%Y')
    ax.xaxis.set_major_formatter(myFmt)
    plt.setp(ax.get_xticklabels(), rotation=cfgDateRotation, ha="right")
    # plt.show()

    # Scaled translation
    # create offset transform (x=5pt)
    from matplotlib.transforms import ScaledTranslation
    fig = plt.gcf()
    dx, dy = 5, 0
    offset = ScaledTranslation(dx / fig.dpi, dy / fig.dpi, scale_trans=fig.dpi_scale_trans)

    # apply offset transform to all xticklabels
    for label in ax.xaxis.get_majorticklabels():
        label.set_transform(label.get_transform() + offset)

Illustration of plots with rotated x-axis labels

No rotation

Rotation without alignment

Rotation with alignment (note the slight misalignment)

Rotation with alignment and scaled translation

Relocation of the parasitic Y axes

The 3 and 4 dataset plots can have their parasitic Y-axes places upon either the right-hand or left-hand side of the graph.

Using this code (for 3 Y axes)

    locationLeft = 60
    locationRight = 60
    if cfgParasiticAxesLocation[0] == 'right':
        par2.spines['right'].set_position(('outward', locationRight))
    # else:
    elif cfgParasiticAxesLocation[0] == 'left':
        par2.spines['left'].set_position(('outward', locationLeft))
        par2.spines['left'].set_visible(True)
        par2.yaxis.set_label_position('left')
        par2.yaxis.set_ticks_position('left')
    else:
        print(f'Warning!!!!!: Incorrect parameter specified.')
        print(f'Warning!!!!!: Check parameters of cfgParasiticAxesLocation[0]: {cfgParasiticAxesLocation[0]}')
        print(f'Warning!!!!!: Value should be \'left\' or \'right\'')
        print(f'Warning!!!!!: Defaulting to \'right\'')
        par2.spines['right'].set_position(('outward', locationRight))
        # print(f'Warning!!!!!: Defaulting to \'left\'')
        # par2.spines['left'].set_position(('outward', locationLeft))
        # par2.spines['left'].set_visible(True)
        # par2.yaxis.set_label_position('left')
        # par2.yaxis.set_ticks_position('left')

or, this code (for 4 Y axes)

    if cfgParasiticAxesLocation[0] == 'right':
        par2.spines['right'].set_position(('outward', locationRight))
        locationRight += locationShift  # The next 'right' is shifted across
    # else:
    elif cfgParasiticAxesLocation[0] == 'left':
        par2.spines['left'].set_position(('outward', locationLeft))
        par2.spines['left'].set_visible(True)
        par2.yaxis.set_label_position('left')
        par2.yaxis.set_ticks_position('left')
        locationLeft += locationShift  # The next 'left' is shifted across
    else:
        print(f'Warning!!!!!: Incorrect parameter specified.')
        print(f'Warning!!!!!: Check parameters of cfgParasiticAxesLocation[0]: {cfgParasiticAxesLocation[0]}')
        print(f'Warning!!!!!: Value should be \'left\' or \'right\'')
        print(f'Warning!!!!!: Defaulting to \'left\'')
        par2.spines['left'].set_position(('outward', locationLeft))
        par2.spines['left'].set_visible(True)
        par2.yaxis.set_label_position('left')
        par2.yaxis.set_ticks_position('left')
        locationLeft += locationShift  # The next 'left' is shifted across

    # Move "Velocity"-axis to the left
    if cfgParasiticAxesLocation[1] == 'right':
        par3.spines['right'].set_position(('outward', locationRight))
    # else:
    elif cfgParasiticAxesLocation[1] == 'left':
        par3.spines['left'].set_position(('outward', locationLeft))
        par3.spines['left'].set_visible(True)
        par3.yaxis.set_label_position('left')
        par3.yaxis.set_ticks_position('left')
    else:
        print(f'Warning!!!!!: Incorrect parameter specified.')
        print(f'Warning!!!!!: Check parameters of cfgParasiticAxesLocation[1]: {cfgParasiticAxesLocation[1]}')
        print(f'Warning!!!!!: Value should be \'left\' or \'right\'')
        print(f'Warning!!!!!: Defaulting to \'right\'')
        par3.spines['right'].set_position(('outward', locationRight))

It is possible to relocate the parasitic Y axes by setting the configuration as desired:

# Location of the 3rd and 4th parasitic axes, use 'left' or 'right'
cfgParasiticAxesLocation = ['left', 'right']

Examples of relocated parasitic Y axes

3 Y axes

Left

Right

4 Y axes

Shown here in the dual axes (dual plot) window.

Left left

Right right

Left right

 

Setting window title

See my answer to Change figure window title in pylab

Note related: https://www.programcreek.com/python/?CodeExample=set+window+title

Other matplotlib documentation links:

Filename dictionary

A dictionary is used for the filenames that the graphs are saved under. Dictionaries guide: https://www.programiz.com/python-programming/dictionary

Figure sizes

The graphs vary is minimum size, especially the 4 axis plot. See How do you change the size of figures drawn with Matplotlib?

Plot size is set using figsize which takes a tuple of (x, y) values in inches:

  • (8, 5) is adequate for 3 Y-axes,
  • (9, 5) is required for 4 Y-axes. For the sake of simplicity, this was used as the default size for all plots

Dual axes figures use twice the width, so that the figsize is set to (2*x, y), or (18, 5). Nevertheless, it should be noted that for the dual axes figures (i.e. two plot windows), when the 4 Y-axes plot is one of the figures, it encroaches on to the other plot’s space – even though (9,5) is sufficient for a 4 Y-axes plot on its own. It is not clear:

  1. Why this encroachment occurs, nor:
  2. How to fix this issue and force (i.e. restrict) each of the two plots into 50% of the total figure space

rcParams

pyplot: Can I set a global marker size parameter?

Using rcParams: This feature has not been fully implemented as of version 6.

OpenCV UI app links

For the image viewer from https://realpython.com/pysimplegui-python/#creating-simple-applications

imports required

  • os

Global warning

After experimenting with config files, I was plagued with random weak warnings appearing in PyCharm, for a while:

Global variable 'flagParasiteThree' is undefined at the module level

See: False Positive https://youtrack.jetbrains.com/issue/PY-35013

However, the issue seems to be related to a .py module containing the globals (which is no longer used now as a .cfg and ConfigParser (or rather configparser) is now used) and this import statement:

from Resources import graph_config

The answer would seem to be here, See also Global variable is undefined at the module level.

Structure

Top Level

  1. Get data
  2. Reorganise data
  3. Plot data

Getting the data

The data can be scraped from the SE.3DP Meta page in one of three ways:

  • Regex – doesn’t work
  • StringSearch
  • BeautifulSoup – HTML parser

The method used is selected by setting a flag whose value is 1-3:

"""
flagScrapeMethod:
        1: Regex
        2: String Search
        3: Beautiful Soup
"""
flagScrapeMethod = 2

The actual CSV data is held between a pair of <pre>code>…</code></pre> tags. Any blank lines that may be preceding or proceeding the CSV data within those tags is ignored.

Graphs produced – flags

""" 
Graphing Permutations:
Questions per day
Answer Rate
AllUsers
Users200
Users2k
Users3k
Answers per question
Questions & Answers
Visits per day
All on one sheet
"""
# flagWeb = False
flagWeb = True
DEBUG=True
flagQuestionsPerDay = True
flagAnswerRate = False
flagAllGraphs = False
flagAllUsers = True
flagUsers2c = False
flagUsers2k = False
flagUsers3K = False
flagAnswersPerQuestion = False
flagQuestionsAndAnswers = True
flagVisitsPerDay = True
flagOneSheet = False

There are many more – NEED TO UPDATE, or just see source code.

Data Index

The data is held in an array. The data indices are as follows:

    Data index:

    0 - Questions per day
    1 - Answer Rate
    2 - Users200
    3 - Users2k
    4 - Users3k
    5 - Answers per question
    6 - Visits per day

Logging

See Python: Debug Logging

A simplistic and lightweight implementation was chosen rather than using the full logging Python package.

def logit(s):
    if DEBUG:
        print(s)

This method allows for a straightforward substitution of print for logit. However, this method doesn’t allow additional data to be passed like so:

logit('length of data:',len(data))

f-strings must be used, like so:

logit(f'Checking date: {topRowDate}')

Tracing function names

From this answer to Determine function name from within that function (without using traceback)

functionNameAsString = sys._getframe().f_code.co_name

Then pass functionNameAsString to logit().

Also, from this comment

import sys 
def thisFunctionName(): 
    """Returns a string with the name of the function it's called from""" 
    return sys._getframe(1).f_code.co_name

This actually works, I would have thought that it returned thisFunctionName but no… this snippet

def RetrieveWebData(url):
    """Take URL. Return CSV data"""
    logit(f'Entering RetrieveWebData(): {thisFunctionName()}')

returns RetrieveWebData.

Issues encountered

Data reformatting and extraction

The data is in the form of seven rows of comma separated values.

However, the data is not presented in a particularly useful format, although the format is useful for data entry and cross validation of the data entry:

heading,data,date,data,date,...,data,date

However, this format does not lend itself to immediate plotting, not without some reorganisation. In particular the inter-mixing of the dates with the data points is problematic. To remedy this the following programatic steps are required:

  • Date checking for each line of data, to ensure consistency and that no typos (i.e. data entry errors) exist
  • Date extraction and create a separate list of dates
  • Removing the dates from the data lists
  • Heading extraction and create a separate list of headings

Once these steps have been taken then the raw data should exist as 7 rows of purely comma separated data values. Note, that a CSV object is not required.

Dates

To plot time sensitive data, the dates need to be in a python date format, as datetime objects. The dates in the CSV are specified in a UTC format (without the time), YYYYMMDD, which helps conversion5,4,5, using strptime(), which is the opposite of strftime().

CSV

Not required6

Web retrieval

Initially, regex didn’t work over multiple lines, so String Search method was used instead. Then web scraping using BeautifulSoup was added.

Regex

Initially .search was used (https://realpython.com/python-web-scraping-practical-introduction/).

pattern = "<pre><code>.*?</code></pre>" 
match_results = re.search(pattern, html, re.IGNORECASE)

However, it would not work over multiple lines, only matches on one line would work:

pattern = "<pre><code>.*?20210411" # This is just a test - works as it is on same line

The trick is to use re.DOTALL, as stated in this answer, re.MULTILINE did not work:

pattern = "<pre><code>.*?</code></pre>"
match_results = re.search(pattern, html, re.DOTALL)
logit(f'match_results:\n{match_results}')
logit(f'match_results.group():\n{match_results.group()}')

gives

<pre><code>*Questions per day*,2.1,20170317,1.9,20180525,1.6,20180705,2.1,20180707,2.7,20180815,2.1,20180903,1.7,20181015,2,20181106,2.4,20190327,3.0,20190905,2.5,20191119,3.9,20210121,2.8,20210411
*Answer rate*,96,20170317,93,20180525,95,20180705,96,20180707,96,20180815,97,20180903,98,20181015,98,20181106,96,20190327,95,20190905,94,20191119,88,20210121,88,20210411
*200+ reputation*,56,20170317,103,20180525,113,20180705,139,20180707,144,20180815,151,20180903,161,20181015,164,20181106,179,20190327,194,20190905,282,20191119,351,20210121,358,20210411
*2,000+ reputation*,4,20170317,8,20180525,9,20180705,10,20180707,11,20180815,12,20180903,14,20181015,14,20181106,17,20190327,19,20190905,22,20191119,27,20210121,27,20210411
*3,000+ reputation*,3,20170317,4,20180525,6,20180705,7,20180707,7,20180815,7,20180903,7,20181015,8,20181106,9,20190327,11,20190905,12,20191119,14,20210121,14,20210411
*Answers per question*,2.0,20170317,1.9,20180525,1.9,20180705,1.9,20180707,1.9,20180815,1.9,20180903,1.9,20181015,1.9,20181106,1.9,20190327,1.9,20190905,1.9,20191119,1.9,20210121,1.9,20210411
*Visits per day*,753,20170317,4,20180525,2324,20180705,2648,20180707,2675,20180815,2774,20180903,2844,20181015,3041,20181106,3707,20190327,2934,20190905,3290,20191119,8756,20210121,7146,20210411

</code></pre>

but, it is very slow – much slower than the String Search, or BeautifulSoup methods.

Note the blank line at the end, which is in the webpage. This needs to be striped using .strip() which is done already by convStr2List(), as the string search method also returned the final blank line. The HTML tags also need removing:

data = data[11:-13]  # Crop to lose the HTML tags

Note that this answer recommends not using .DOTALL, but that recommendation is for the OP’s particular case.

String Search

CSV block is retrieved as a multiline string, with \n separators. Split on \n.

Row is then split7 on a comma (,).

However, all of the previously numerical values are now strings, so they must be converted to numbers (integer and float respectively)8, 9. Using a small function:

def num(s):
    """Return a number (float or int)"""
    logit(f'Entering {thisFunctionName()}()')

    # https://stackoverflow.com/questions/379906/how-do-i-parse-a-string-to-a-float-or-int#comment100283780_379910
    # return int(a) if float(a) == int(float(a)) else float(a)
    # or
    # https://stackoverflow.com/a/379966/4424636
    try:
        return int(s)
    except ValueError:
        return float(s)

Each element in the list is passed to num(). Except the first element, which is, at this stage of processing, i.e. the Web Extraction stage, still the heading – the headings haven’t yet been stripped out and extracted:

def stringStripper(data):
    """Remove the single quotes around the numbers (irrespective of whether they are ints or floats"""
    logit(f'Entering {thisFunctionName()}()')

    logit(f'Before stripper: {data}')
    for row in data:
        for item in range(1, len(row), 1):
            logit(f'rowitem{item}: {row[item]}')
            row[item] = num(row[item])
        logit(f'row: {row}')
    logit(f'After stripper: {data}')
    return data

Using BeautifulSoup to scrape

To simplify things, using an id tag in the <pre><code> section would be a good idea17. However, in the sprit of making things as difficult as possible and not wishing to mess with the natural order of things, it would seem prudent to just deal with the page as it stands, without adding any niceties (in a similar manner to the annoying commas in the 2k and 3k user reputation headings).

On the meta page there are multiple instances of <code></code> tags, but only one <pre></pre> tag (which encapsulates the <code></code> that contains the actual CSV data that we want).

Single character labels in the legend

This only happens for headings when there is only one set (i.e. one slice) of graph data passed. This is because a tuple is required10:

# Need brackets around a single heading label
# See https://stackoverflow.com/a/66372700/4424636
plotAllGraphsStack(dates, data[6], [headings[6]])

Split heading

As two of the headings contain a comma (,), namely *2,000 reputation* and *3,000 reputation* then there was an issue when splitting the corresponding row of data, where by the heading was split into two parts. To remedy this, there are three options:

  1. Split using a better method – not researched
  2. Change the heading itself to remove the comma, such that the headings become *2000 reputation* and *3000 reputation*  – this is the lazy option
  3. Write a post processing kludge that rejoins the two headings11, 12

Option three:

def rejoinKUsers(data):
    """Manually join the split 2K and 3K user headings - that were split due to a comma in the heading"""
    logit(f'Entering {thisFunctionName()}()')

    for row in data:
        logit(f'row[0]: {row[0]}')
        if row[0] == '*2' or row[0] == '*3':
            logit('joining!!!!')
            row[0:2] = [','.join(row[0:2])]
    return data

Two axes in one fig

Some points:

  • Possible to use existing functions and pass ax as a context within which the plots are made.
  • The figsize modifier determines the size of the fig, rather than the size of the axes, so it would need to be twice the size specified in the graphSize parameter. Or rather the width of the tuple needs to be doubled, like so:
graphSizeTwiceAsWide = (graphSize[0]*2, graphSize[1])
  • fig.tight_layout() is required, else the two axes overlap

Implementations

Generic index called plots

In order to further simplify the calling of the individual graph plotting functions – thereby reducing the amount of code and number of functions required, whilst admittedly at the same time complicating, and making the code less readable – there is an obvious implementation change that can be made.

As the individual graph drawing functions are nearly identical, differing in only the index number of the data row, it is possible to loop through the graph plot using an index and passing that index to a index specific graph drawing function:

def plotIndexGraph(dates, data, headings, index=5):
    """Individual graph - Plots the individual index graph - line graph - one set of axes"""
    """Defaults to answers per question"""
    logit(f'Entering plotIndexGraph()')
    logit(f'headings[{index}]: {headings[index]}')
    plotAllGraphs(dates, [data[index]], [headings[index]],
                  headings[index])

In order to facilitate this, the flags need to be put into a list that can also reference the individual flags using the same index:

flagsIndividual = [flagQuestionsPerDay, flagAnswerRate, flagUsers2c, flagUsers2k, flagUsers3K, flagAnswersPerQuestion, flagVisitsPerDay]

The same can be done for the combinational graph flags.

To get the indices of the flags set to True13:

[i for i, x in enumerate(flagsIndividual) if x]

Note that a dictionary could be used, where the dataset name is the key to an index (which would improve readability – not implemented. However, it is not really necessary when using this flag implementation.

DataFrame

If the CSV data can be read as a CSV using pandas.read_csv(), then more meaningful plots can be easily generated14. For example, a pie chart showing the proportion of users – or a pie chart that shows how the proportion varies over time. However, .read_csv() requires a URL, which we don’t have, that is to say that CSV is buried in a SE meta page. So create using a List, or Dictionary15.

Currently, as the data is stored, the DataFrame is horizontal:

def createDataFrameHorizontal(dates, data, headings):
    import pandas as pd

    df = pd.DataFrame(data, columns=dates, index=headings)
    print(f'df:\n{df}')

which produces:

df:
                      2017-03-17  2018-05-25  ...  2021-01-21  2021-04-11
Questions per day            2.1         1.9  ...         3.9         2.8
Answer rate                 96.0        93.0  ...        88.0        88.0
200+ reputation             56.0       103.0  ...       351.0       358.0
2,000+ reputation            4.0         8.0  ...        27.0        27.0
3,000+ reputation            3.0         4.0  ...        14.0        14.0
Answers per question         2.0         1.9  ...         1.9         1.9
Visits per day             753.0         4.0  ...      8756.0      7146.0

[7 rows x 13 columns]

However, if we want a vertical DataFrame, then, the data need to be re-arranged from straight pure data aligned lists:

[[Q/D1,Q/D2,...Q/Dn],[AR1, AR2, ..., ARn], ..., [V/D1, V/D2, ..., V/Dn]]

to date aligned mixed lists:

[[Q/D1, AR1, ... V/D1], [Q/D2, AR2, ... V/D2], ..., [Q/Dn, ARn, ... V/Dn]]

Using

def interspliceData(data):
    """Mix the data up - one row is all the data for one date"""
    splicedData=[]
    for n in range(len(data[0])):
        splicedRow = []
        for row in data:
            splicedRow.append(row[n])
        print(f'splicedRow: {splicedRow}')
        splicedData.append(splicedRow)
    print(f'splicedData: {splicedData}')

    return splicedData

Followed by

def createDataFrameVertical(dates, data, headings):
    import pandas as pd

    df = pd.DataFrame(data, columns=headings, index=dates)
    print(f'df:\n{df}')

Now we get:

df:
            Questions per day  ...  Visits per day
2017-03-17                2.1  ...             753
2018-05-25                1.9  ...               4
2018-07-05                1.6  ...            2324
2018-07-07                2.1  ...            2648
2018-08-15                2.7  ...            2675
2018-09-03                2.1  ...            2774
2018-10-15                1.7  ...            2844
2018-11-06                2.0  ...            3041
2019-03-27                2.4  ...            3707
2019-09-05                3.0  ...            2934
2019-11-19                2.5  ...            3290
2021-01-21                3.9  ...            8756
2021-04-11                2.8  ...            7146

[13 rows x 7 columns]

From these DataFrames, graphs can be easily formed.

(Not yet implemented)

.

No legend for single graph

There is no need for a legend for an individual graph as the name of the data set is in the title. However how can we make use of the plotAllGraphs(), which does try to print a label – it has a label argument in the call to plt.plot():

plt.plot(x, y[i], 'b-^', linewidth=3, markersize=6, label=heading[i])

and yet not plot a label for individual graphs..? If we pass an empty string, then no legend is displayed:

# Using "" prevent legend being printed but also results in a red line warning
plotAllGraphs(dates, [data[0]], [""], headings[0])

However, it gives a console warning in red

No handles with labels found to put in legend.
No handles with labels found to put in legend.

But using double quotes is the method employed in bigcats ratio, without a warning being produced

big_cat_totals.plot(kind="pie", label="")

However, the legend re-appeared, since the change to using generic index called plots (see above) for individual graphs. Fixed by changing:

plotAllGraphsAx(dates, [data[index]], [headings[index]], ax,
                   title=headings[index])

to

plotAllGraphsAx(dates, [data[index]], [''], ax, title=headings[index])

Likewise for plotAllGraphsAxHead()plotAllGraphsStackAx() and plotAllGraphsStackAxHead()

Colour cycling

Have a list of line graph colours, and manually cycle through them, using an index:

graphColours = ['b', 'r', 'g', 'y', ...]

or even easier just use colour cycling16, using cycler :

prop_cycle=(cycler('color', ['r', 'g', 'b', 'y']) + cycler('linestyle', ['-', '--', ':', '-.']))

Using cycler() as an argument to rc():

from cycler import cycler

graphColours = ['r', 'g', 'b', 'y', 'c', 'm', 'k']
graphLines = ['-', '--', ':', '-.']
plt.rc('lines', linewidth=4)
plt.rc('axes', prop_cycle=(cycler('color', graphColours) +
                           cycler('linestyle', graphLines)))

However, this resulted in an error as the line and colour lists were not the same dimension:

ValueError: Can only add equal length cycles, not 7 and 4

Unfortunately only four short hand line definitions exist (https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html), so in order to keep the number of colours, the line sequence was partially repeated:

    # graphColours = ['r', 'g', 'b', 'y']
    # graphLines = ['-', '--', ':', '-.']
    graphColours = ['r', 'g', 'b', 'y', 'c', 'm', 'k']
    graphLines = ['-', '--', ':', '-.', '--', ':', '-.']

However for 7 datasets the colours are sufficient and the lines will never be used. First the colour are cycled through and then the line is changed for the colours to be repeated. So for 7 datasets 3 colours by 3 lines would be sufficient. It is worth noting that the colours may not need specifying as the first example (here) shows.

Changing the size of plots that don’t use subplot

The first sets of graphs didn’t use fig, ax = subplot(),

fig, ax = plt.subplots(figsize=graphSize)

but just plt.<function>(), so the size is set using plt.figure(figsize=...). See https://stackoverflow.com/q/332289/4424636

Placement of plt.figure(figsize=...)

 plt.figure(figsize=...) needs to be at the start of the sequence of plt commands, otherwise a shadow plot is created

    plt.figure(figsize=graphSize)
    plt.grid(True)
    plt.xlabel('Date')
    plt.ylabel(title)
    plt.title(title)
    # plt.figure(figsize=graphSize)

However, if it at the end then the line cycler is used, whereas when it is at the top the lines are all solid

Stacked array plot

Stacked plots only makes sense for a combined Users rep. However, before a combination can be plotted some pre-processing of the user reputation data is required because the number of users with greater than 200 rep also contains users with greater than 2k rep and users with greater than 3k rep, and likewise the number of users with greater than 2k rep also contains users with greater than 3k rep.

Plotting the figures as they stand (without adjustment) will give incorrect representations of the plots for users with reputations greater than 200 and 2k, although the overall total plot and the plot of users with greater than 3k rep will still be correct.

So, stripping out the “pure” values:

def getPureUsers(usersData):
    """Get the pure numbers of the users with rep of: 200-1999; 2000-2999; and 3000+"""
    logit(f'Entering {thisFunctionName()}()')

    usersData2C_pure = []
    usersData2k_pure = []
    usersData3k_pure = []

    for i in range(len(usersData)):
        users2C = usersData[i][0]
        users2k = usersData[i][1]
        users3k = usersData[i][2]
        users2C_pure = users2C - users2k
        users2k_pure  = users2k-users3k
        users3k_pure  = users3k
        usersData2C_pure.append(users2C_pure)
        usersData2k_pure.append(users2k_pure)
        usersData3k_pure.append(users3k_pure)
    usersData_pure=[[usersData2C_pure],[usersData2k_pure],[usersData3k_pure]]
    return usersData_pure

And call with

def plotAllUsersPureGraphsStack(dates, data, headings):
    """Combined graph - Plots pure 200, 2k and 3k user rep graphs - stack graph - one set of axes"""
    logit(f'Entering {thisFunctionName()}()')

    data3DP_nodates_allUsers = data[2:5]
    data3DP_nodates_allUsersPure = getPureUsers(data3DP_nodates_allUsers)
    headings_allUsers = headings[2:5]
    plotAllGraphsStack(dates, data3DP_nodates_allUsersPure, headings_allUsers)

 

Writing to file

The plot maybe saved to disk as either a PDF or a PNG. In the case of the latter, a DPI setting needs to be provided:

flagWriteGraph = False     # Save generated graphs
flagWritePDFNotPNG = True  # True: PDF, False: PNG
numDPI = 200               # DPI setting for PNG

Each graph type is currently saved using a unique descriptive name, describing that graph style (i.e. the particular plotting function called). There is no option to prompt for, or change, the file name and existing saved files are overwritten. Adding a time stamp could be a solution.

All filenames are timestamped18 to prevent overwriting.

If it is expected that the same graph will be plotted and saved repeatedly in less than a second period, then the timestamp can be flagged to include microseconds:

flagFileTime = True        # Use filename timestamp
flagAccurateTime = False   # Use microseconds in filename timestamp

The filenames are stored in a dictionary, with the key being the function name.

strSavedFilePath = 'Saved_Graphs'

# Dictionary: functionName --> filename
dictFileNames = {'plotParasite3Graph': 'pyplot_multiple(3)_y-axis', 'plotParasite4Graph': 'pyplot_multiple(4)_y-axis'}

If no corresponding function name is found in the dictionary then the file is not saved and a warning is given.

The generic command to call savePlot() is:

 savePlot(dictFileNames.get(thisFunctionName()))

Use .get() to check for filename in dictionary, as it returns None if no corresponding entry is found20, and then appropriate action (i.e. warning message and skip file save) can be taken if None is returned.

def savePlot(strName):
    """Save the current plotted graph"""

    # if strName != None:
    if strName is not None:
        if flagFileTime:
            if flagAccurateTime:
                strNow = datetime.datetime.now().strftime('%Y%m%d%H%M%S%f')
            else:
                strNow = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
            fileName = strName+strNow
        else:
            fileName = strName

        fileName = strSavedFilePath+'/' + fileName

        if flagWriteGraph:
            if flagWritePDFNotPNG:
                # Best for professional typesetting, e.g. LaTeX
                plt.savefig(fileName+".pdf")
            else:
                # For raster graphics use the dpi argument. E.g. '[...].png", dpi=200)'
                plt.savefig(fileName+".png", dpi=numDPI)
    else:
        # functionName has no corresponding filename
        print("WARNING: No known filename to save graph - ignoring!!!!!")

Note that utcnow should not be used19 and a timezone should also be employed:

# strNow = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')  # https://stackoverflow.com/a/62762927/4424636
# Do not use utcnow, use now - https://stackoverflow.com/a/62762984/4424636
strNow = datetime.datetime.utc(tz = datetime.timezone.utc).strftime('%Y%m%d%H%M%S')

Graph function reorganisation

Prior to v5, the graphs are plotted by a series of nested function calls. The first prepares the data for the particular dataset to be plotted, that then calls a generic plotting function (stacked, line, etc).

Splitting graph plots

In order to accommodate the multiple graphs in one fig, it is necessary to split out the actual plotting from the creation of the fig and axes. The graph plotting functions are split into two, with suffixes:

  • ...Ax() – just plots the graph, and is passed the axes context
  • ...AxHead() – creates (and prepares) the axes and fig contexts in which the graphs are plotted. Also sets the layout, saves the plot and shows the plot

...AxHead() replaces the previous graph plotting function.

Note: for plotGraph() and plotAllGraphs(), this doesn’t really make a lot of sense, as there is currently no fig or ax objects

Renaming the calling functions

The calling functions prepare the dataset prior to the graph plotting function being called. These are not new. However, they are all prefixed with plot...(). So, to avoid confusion with the graph plotting functions, these are renamed as selectPlot...()

 

Config files

Put all of the parameters that can be set by the user into a text file. See https://stackoverflow.com/q/924700/4424636

Formats:

Using config parser

Note, you have to use configparser and not ConfigParser.

However, the variable names are all converted to lower case and therefore do not match the actual variable names. See https://stackoverflow.com/a/19359720/4424636 for explanation.

To fix, you need to add cfg.optionxform = str (see configparser – ConfigParser.RawConfigParser.optionxform):

cfg = configparser.RawConfigParser()
cfg.optionxform = str

Then the flags that were previously type bool are now returned as str. The fix would be this answer but requires wholesale variable name change, although this would be restricted only to the cfg file:

[Section]
s_path = D:\
f_number = 10.0
b_boolean = False

...

def type_convert(items):
    result = []
    for (key, value) in items:
        type_tag = key[:2]
        if type_tag == "s_":
            result.append((key[2:], value))
        elif type_tag == "f_":
            result.append((key[2:], float(value)))
        elif type_tag == "b_":
            result.append((key[2:], bool(value)))
        else:
            raise ValueError('Invalid type tag "%s" found in ini file.' % type_tag)
            # alternatively: "everything else defaults to string"
    return result

...

self.__dict__.update(dict(type_convert(parser.items("Section"))))

So using the following prefixes:

  • b_ for Boolean
  • s_ for string (there are none)
  • f_ for float (there are none)
  • i_ for int
  • t_ for tuple of int
  • l_ for list

However, the is a an issue with lists: how to differentiate a list of ints from a list of strings? Use ls_ or li_? It is possible but starts to get overly complex.

Other options are here, Converting ConfigParser values to python data types.

Wouldn’t it be easier to just read the type before and then set the type again after? Or use the long winded method for each parameter in the config file and use get_boolean(), get_int(), etc.?

Or use a JSON file, see this answer to Use ConfigParser with different variable types in Python.

Doing it manually

Lists

Functions to read int, boolean are built in. However, for a list of ints. see this answer for splitting on newlines, this answer for splitting on commas (both answers to Lists in ConfigParser). Plus some stripping: first the brackets, then spaces and then quotes – in the case for strings:

cfgGraphColours = [i.strip().strip("'") for i in config.get("graphing_params", "cfgGraphColours").strip('[]').split(',')]  # This is a list of strings

and; first the brackets, then spaces, then int conversion – in the case of ints:

flagThreePlotDataSet = [int(x) for x in config.get('multiple_y_axes', 'flagThreePlotDataSet').strip('[]').split(',')]
Tuples

The tuple (for figsize) required manually recreating the tuple from a list:

cfgGraphSize_int = [i.strip() for i in config.get("graphing_params", "cfgGraphSize").strip('()').split(',')]  # Make a list first
cfgGraphSize = (int(cfgGraphSize_int[0]), int(cfgGraphSize_int[1]))  # Then create the tuple

See also Storing and retrieving a list of Tuples using ConfigParser – although I can’t remember if it was useful (in this case)

File checks

Finally checks were added to ensure that the directory and file exist, from Python Check If File or Directory Exists:

How to check If File Exists

  • os.path.exists() – Returns True if path or directory does exists.
  • os.path.isfile() – Returns True if path is File.
  • os.path.isdir() – Returns True if path is Directory.
  • pathlib.Path.exists() – Returns True if path or directory does exists. (In Python 3.4 and above versions)

Accessing modules in a Subdirectory:

TUI – Text User Interface

RC

Moved the rc commands in plotAllGraphs() to a separate function as they interfered with the other graphs. Made linewidth and markersize both definable, centralised and constant.

However, the rc settings don’t seem to work on the ax. plots, only plt. plots.

Area51 scraper

  • Check if we already have a local copy of the CSV
  • Web scrape meta page for CSV if no local copy
  • Check we don’t already have today’s data
  • Scrape Area51 page
  • Add/append new data and date
  • Save CSV list as pickle and string as text

For reading the last timestamped file, see Reading files in a particular order in python and Download the latest file according to timestamp in file name from SFTP server

Markdown comparison

The original “manual” markdown:

 - *Questions per day* <strike>**2.1**</strike> -> <strike>1.9</strike> <strike>1.6</strike> <strike>2.1</strike> <strike>2.7</strike> <strike>2.1</strike> <strike>1.7</strike> <strike>2</strike> <strike>2.4</strike> <strike>3.0</strike> <strike>2.5</strike> <strike>3.9</strike> 2.8
 - *Answer rate* <strike>**96&nbsp;%**</strike> -> <strike>93&nbsp;%</strike> <strike>95&nbsp;%</strike> <strike>96&nbsp;%</strike> <strike>97&nbsp;%</strike> <strike>98&nbsp;%</strike> <strike>96&nbsp;%</strike> <strike>95&nbsp;%</strike> <strike>94&nbsp;%</strike> 88&nbsp;%
 - *Users*
  - *200+ reputation* <strike>**56**/150</strike> -> <strike>103/150</strike> <strike>113/150</strike> <strike>139/150</strike> <strike>144/150</strike> <strike>151/150</strike> <strike>161/150</strike> <strike>164/150</strike> <strike>179/150</strike> <strike>194/150</strike> <strike>282/150</strike><sup>*</sup> <strike>351/150</strike> 358/150
  - *2,000+ reputation* <strike>**4**/10</strike> -> <strike>8/10</strike> <strike>9/10</strike> <strike>10/10</strike> <strike>11/10</strike> <strike>12/10</strike> <strike>14/10</strike> <strike>17/10</strike> <strike>19/10</strike> <strike>22/10<sup>*</sup></strike> 27/10 
  - *3,000+ reputation* <strike>**3**/5</strike> -> <strike>4/5</strike> <strike>6/5</strike> <strike>7/5</strike> <strike>8/5</strike> <strike>9/5</strike> <strike>11/5</strike> <strike>12/5<sup>*</sup></strike> 14/5
 - *Answers per question* ratio is <strike>**2.0**</strike> -> 1.9
 - *Visits per day* <strike>**753**</strike> -> <strike>4</strike> <strike>2,324</strike> <strike>2648</strike> <strike>2675</strike> <strike>2774</strike> <strike>2844</strike> <strike>3041</strike> <strike>3707</strike> <strike>2934</strike> <strike>3290</strike> <strike>8756</strike> 7146

The new scripted markdown:

 - *Questions per day* <strike>**2.1**</strike> -> <strike>1.9</strike> <strike>1.6</strike> <strike>2.1</strike> <strike>2.7</strike> <strike>2.1</strike> <strike>1.7</strike> <strike>2</strike> <strike>2.4</strike> <strike>3.0</strike> <strike>2.5</strike> <strike>3.9</strike> 2.8
 - *Answer rate* <strike>**96&nbsp;%**</strike> -> <strike>93&nbsp;%</strike> <strike>95&nbsp;%</strike> <strike>96&nbsp;%</strike> <strike>97&nbsp;%</strike> <strike>98&nbsp;%</strike> <strike>96&nbsp;%</strike> <strike>95&nbsp;%</strike> <strike>94&nbsp;%</strike> 88&nbsp;%
 - *Users*
   - *200+ reputation* <strike>**56/150**</strike> -> <strike>103/150</strike> <strike>113/150</strike> <strike>139/150</strike> <strike>144/150</strike> <strike>151/150</strike> <strike>161/150</strike> <strike>164/150</strike> <strike>179/150</strike> <strike>194/150</strike> <strike>282/150</strike><sup>*</sup> <strike>351/150</strike> 358/150
   - *2,000+ reputation* <strike>**4/10**</strike> -> <strike>8/10</strike> <strike>9/10</strike> <strike>10/10</strike> <strike>11/10</strike> <strike>12/10</strike> <strike>14/10</strike> <strike>17/10</strike> <strike>19/10</strike> <strike>22/10</strike><sup>*</sup> 27/10
   - *3,000+ reputation* <strike>**3/5**</strike> -> <strike>4/5</strike> <strike>6/5</strike> <strike>7/5</strike> <strike>8/5</strike> <strike>9/5</strike> <strike>11/5</strike> <strike>12/5</strike><sup>*</sup> 14/5
 - *Answers per question* ratio is <strike>**2.0**</strike> -> 1.9
 - *Visits per day* <strike>**753**</strike> -> <strike>4</strike> <strike>2324</strike> <strike>2648</strike> <strike>2675</strike> <strike>2774</strike> <strike>2844</strike> <strike>3041</strike> <strike>3707</strike> <strike>2934</strike> <strike>3290</strike> <strike>8756</strike> 7146

Rendered output

Original

New

Differences from original markdown list

  • Missing indentation for the user reputation sub-list fixed (one additional space was required – two spaces to indent, previously there was only one)
  • The asterisk, at the user reputation change was striked out, for the 2k and 3k users, now it is not
  • The numerator only was in bold, now the whole fraction is in bold

Still to do

  • No legend on single graphs – half done – warning is shown
  • Dual axes in one fig – done with fixed kludge
  • Use dual axes in one fig and pass ax to the graph plotting functions to plot inside (is this possible?) – done
  • Dual units on two y-axes (left and right) – done
  • Add parasite axis for combined four way plot – done
  • Choose which graphs for three way parasite – done
  • Different colours in multiple line plot# – done
  • Use generic index plot for individual plots – half done – written but not called
  • Put flags into an array or list for cycling through – half done – written but not called
  • Regex data capture – done
  • Use beautiful soup for extraction – done
  • Use List for DataFrame – done
  • Use DataFrame to plot
  • Fix the stack graphs – done
  • Make all graphs the same size – the same as the parasite – use a pair of size variable parameters – half done
  • Add PDF save to all graphs
  • Add a time stamp to graph saves – done
  • If same scale (for QPD and APQ), then hide one Y-axis (parasite or otherwise) – Is it possible to hide a Y-axis while still keeping the label, i.e. use two labels on one Y-axis?
  • Fix colours on parasite – either replace current method, or find number for standard RGB (r, g, b, y, g, etc.)
  • Move imports to top – done
  • Filenames – parametrise in dictionary – half done – in process
  • All data logit() calls should use \n – done
  • Slanted dates – https://realpython.com/pandas-plot-python/#outliers
  • Added dates to 4 axis parasite
  • Title for 4 axis parasite – done
  • Title for 3 axis parasite
  • Fig/window title for dual plot
  • Add print to dual plot – what???? I think this meant save not print
  • Add save to dual plot – done
  • Titles for all
  • Split all graph plotting functions into fig and ax for modularity? Have …Ax and …AxHead – half done plotgrpah and plotAllgrpahs not done, they have no ax context
  • Make plotGraph and plotAllGraphs have a fig and ax context
  • Delete plotGraphList() – done
  • Actually call the generic stack and line plotters and… – done
  • Delete the individual plotters – done
  • All plotting functions are called plot...() , even those that don’t actually plot but just call plot functions. Need to rename the plot callers as callPlot...()
  • Lose the legend again for single plots – it has reappeared
  • 4Y plot seems to large to fix in the figsize, or the tight_layout isn’t working. – check with the old version.
  • Lines seem to be thicker now (especially on the 4Y), since stacked (?) is not set. Checked, plot lines are thicker with stacked not set (is it the rc being set – if so, then need to save the previous method)
  • rc settings do not work for ax, maybe use rcParams? https://stackoverflow.com/a/41717533/4424636
  • fix config file, variable types
  • ‘ha=’ setting not working
  • fig size use rc or rcParams?
  • fig size – is there a class, i.e. standard (2Y), 3Y and 4Y for the width? or base it on the number of y axis => 2/3/4?

Notes/Conclusions

  • The answers per question is superfluous (as it is mostly flat line) and just clutters the four way graph, and maybe better as a standalone graph, immediately below the three way parasite graph, so that they can be easily compared, if so required.
  • The three way graph of questions per day, answer rate and visits per day is the most interesting. Even so, the answer rate could also be considered superfluous.
  • The visits per day strongly tracks the questions per day
  • Using the same scale for the questions per day and the answers per question just flattens the answers per question, Autoscaling for all axes (apart from (maybe) the Answer Rate ) on the 4 way parasite seems to produce the most aesthetically pleasing results.
  • The Answer Rate graph is flattened when shown on a 0-100 scale, but this doesn’t really matter and the curves can still be seen.

Finally, a UI would be useful for setting the flags/options: Python Terminal/Text UI (TUI) library [closed]

Random history (Totally unrelated)

This answer to Why does std::bit_width return 0 for the value 0, shouldn’t it return 1?

Matplotlib courses

https://realpython.com/search?q=matplotlib

Links for issues

Same axes

  • dd

Subplot size

Parasite

Function comparison

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s