Building a Blog with Pandoc and Git

I think one of the most important points about having a blog type website, is keeping track of dates, but a writer does not necessarily update the date on the files to reflect the actual publish date or edit date, on the other hand keeping versions on when you introduced a change and what is what you actually change, are important pieces of the puzzle if you are planing to revisit your writing or sharing it any kind of review.

Nowadays, we have tools like WordPress which are tools classified as Content Management System, and by the way, it predates git.

Most of the time, WordPress is the go-to solution for blogs. Moving forward in time, if you were tech-savvy, you could use Emacs org mode, to build websites, and since it was a reflection of your org mode notes, it made sense to use it to publish a blog, even further in time, we got tools as Gatsby and Hugo, that would assist in building websites, based on text file content.

Now in this entry I’m proposing the use of pandoc to build websites.

We might have with this approach is that this tool only converts single documents from one format to another, but in combination with make it does really nicely build a full website. (In fact, my website is built this way).

In this entry I won’t go in to the details of my implementation, rather I will explain the core ideas behind it.

Let us assume you we have a directory structure as follows:

.
├── Makefile
├── metadata.yaml
├── .git/
├── templates/
├── filters/
├── src
│   ├── assets
│   │   ├── styles.css
│   │   ├── script.js
│   │   └── thumbnail.jpg
│   ├── contact.md
│   └── index.md
└── build/

The presence of .git in the previous file tree is to indicate that we are under version control, and the build is the directory where we store the output build of the site (this actually should be in the .gitignore).

I am ignoring the blog portion in this directory tree.

Now let’s talk about the Makefile content, if you are not familiar with make it’s a program used initially for compilation, and if you think about it, we are kind of compiling the website, doing the following as analogy.

pandoc as gcc
Markdown files as source files
metadata.yaml file as configuration files
Assets as just assets for the project
filters and templates as configuration for the compiler and a given file

So the content will look something like the following:

SRC_DIR := src
BUILD_DIR := build
PANDOC := pandoc
PANDOC_FILTERS := \
    --filter filters/filter_A.lua \
    --filter filters/filter_B.py

PANDOC_METADATA = \
    --metadata-file metadata.yaml \
    --metadata=path="$(shell dirname $@ | sed -s 's/build//g' )" \
    --metadata=git_initial_date="$(shell git log --reverse -n 1 --pretty=format:%aD -- $<)" \
    --metadata=git_date="$(shell git log -n 1 --pretty=format:%aD -- $<)" \

PANDOC_COMMAND = $(PANDOC_COMMAND) $(PANDOC_FILTERS) $(PANDOC_METADATA)

PAGES = # Define here which pages to publish.

all: $(PAGES)

$(BUILD_DIR)/%.html: $(SRC_DIR)/%.md
    mkdir -p $(shell dirname $@)
    $(PANDOC_COMMAND) -i $< -o $@

The rule we have added is to take a markdown file, create its path in the build directory, then build using the markdown file as input, producing a HTML file.

In order to visualize it, we can use a simple HTTP server as the one python provides, using the following command.

python -m http.server --directory build

Since the output is just files (not programs meant to be executed in the server to produce dynamic content), we can deploy this output to a HTTP server, like Apache or Nginx, or even to a static storage as an S3-like service.

In order to create the blog structure, we need a custom rule for the entries portion of the site, to decide first which URL is going to be the canonical URL for any given entry, in other words, how would you like to reference the page when sharing or visualizing.

In this case we do not want to have the .html extension in your URLs for your entries we can work the entries in directories, given that most of the HTTP server for an URL as https://domain.com/example/ will load by default https://domain.com/example/index.html, and we can use this to our advantage.

If we do not mind having the extension in the URL then the procedure should be fairly similar.

Now let us create a directory under src named blog, in there we should be creating one directory per entry.

.
├── metadata.yaml
└── src
    ├── assets/
    ├── contact.md
    ├── index.md
    └── blog
        ├── index.md
        ├── first-entry
        │   └── index.md
        └── example-entry
            ├── assets/
            └── index.md

In the tree above, we can see we have:

- Blog Index
- One entry with no assets
- One entry with assets

Since the blog directory is somewhat special, we would need to create a rule meant for entries only, which will end up being not so different as follows.

COMMENT_SECTION := templates/comment-section.html
PANDOC_BLOG_FILTER := filters/blog_feed.py

$(BUILD_DIR)/blog/%.html: $(SRC_DIR)/blog/%.md $(COMMENT_SECTION)
    mkdir -p $(shell dirname $@)
    [[ -d "$(shell dirname $<)/assets" ]] \
        && cp -r $(shell dirname $<)/assets/. $(shell dirname $@)/assets \
        || echo "Entry with no assets"
    $(PANDOC_COMMAND) \
        --filter=$(PANDOC_BLOG_FILTER) \
        --include-after-body=$(COMMENT_SECTION) \
        -i $< -o $@

In this case, we are copying the asset directory (if it exists). And we are adding a comment section to the page.

In case you do want to add comments it can be handle by adding the script required for it (like a Disqus feed), to the template file templates/comment-section.html.

Now here it comes the most important part of the blog the feed, and this is achieved using the pandoc metadata and git dates, and placing them in a list (csv file), using a filter.

The following script is actually a panflute filter that will read the metadata of your Markdown document, and it will write to a file named .entries.csv at the root of the project.

import panflute as pf

def action(elem, doc):
    pass

def finalize(doc):
    keys = [
        "git_initial_date",
        "git_date",
        "title",
        "subject",
        "path"
    ]
    rss_data = {}
    for k in keys:
        rss_data[k] = doc.get_metadata(k, "")
    with open(".entries.csv", "a") as file:
        writer = csv.DictWriter(file, fieldnames=keys[:-2])
        writer.writerow(rss_data)


def main(doc=None):
    return pf.run_filter(action, finalize=finalize, doc=doc) 

if __name__ == "__main__":
    main()

Once the execution is done for all entries, we are going to see the file .entries.csv, given that it’s a CSV file, we can parse to create the RSS XML, and the feed list for our blog cover by creating the following to rules in our make file.

ENTRIES_LIST := .entries.csv

$(BUILD_DIR)/blog/index.html: $(SRC_DIR)/blog/index.md $(ENTRIES_LIST)
    cat $(ENTRIES_LIST) | $(SORT) | $(CSV_2_HTML) | \
        $(PANDOC_COMMAND) -i $< -i - -o $@

$(BUILD_DIR)/blog/rss.xml: $(ENTRIES_LIST)
    cat $(ENTRIES_LIST) | $(SORT) | $(CSV_2_RSS) > $@

Let’s assume we have two programs CSV_2_RSS and CSV_2_HTML that already do the job of converting the CSV to the desire outputs, which should be fairly simple to write.

Assuming everything is working fine up to this point, you should be having a functional static site builder by now.

Remarks

It’s recommended to work in branch per entry, so we keep the master branch clean, and can deploy a build from it, without the noise of a dirty git status.
Using make provides a lot of advantages, for instance, we usually want to have the files optimized for the web, and with make we can add rules to do that, targeting images to compress them, or videos to encode them.
Global navigation can be achieved by editing the default template you are using.
A tagging system is also possible, by using a metadata value to carry keywords.

I think this setup is mostly getting in the mindset of treating pandoc as a compiler, instead of a document converter.

If you want to extend this set up, I have a proof of concept on Github right here.