A comprehensive guide for writing blog posts on this site, including frontmatter reference, section structure, formatting, and a worked example.

Blog Writing Guide & Template

This guide explains how to write and publish a blog post on this site. It covers the required frontmatter fields, recommended section structure, markdown formatting reference, and a complete worked example.

1. Frontmatter Reference

Every blog post must begin with a YAML frontmatter block. Here is a complete example:

---
title: "Why RLHF Works Better Than You Think"
description: "A clear-eyed look at reinforcement learning from human feedback: what it actually optimises for, when it fails, and what comes next."
date: 2024-03-15
category: Reinforcement Learning
tags: [rlhf, alignment, llm, post-training]
featured: false
---

Field	Type	Required	Notes
`title`	string	Yes	Keep under 80 characters. Use sentence case.
`description`	string	Yes	1–2 sentences. Shown in list views and SEO meta.
`date`	YYYY-MM-DD	Yes	Publication date. Posts are sorted by this field.
`category`	string	Yes	A single category (e.g. `Machine Learning`, `Career`).
`tags`	string[]	Yes	2–5 lowercase tags without spaces. Use hyphens.
`featured`	boolean	No	Set `true` to feature on the home page. Defaults to `false`.

2. Recommended Section Structure

A well-structured blog post typically follows this pattern:

Introduction (hook + thesis)

Open with a concrete question, claim, or observation that the reader will care about.
State your main argument clearly within the first three paragraphs.
Avoid preamble ("In this post I will…").

Body (argument + evidence)

Organise into 3–6 sections with clear ## headings.
Each section should make one coherent point.
Use examples, code snippets, or equations to ground abstract claims.
Keep paragraphs short (3–5 sentences).

Conclusion (synthesis + takeaways)

Summarise what you showed, not just what you said.
End with a concrete implication, open question, or call to action.

Key Takeaways (optional but recommended)

A bullet-point summary for readers who skim:

Key Takeaways

RLHF optimises for reward model score, not for human preference directly.

Reward hacking is a feature of the optimisation landscape, not a bug.

Constitutional AI and process reward models address different failure modes.

3. Markdown Formatting Reference

Headings

## Section Heading (use for major sections)
### Subsection Heading (use for sub-points)
#### Minor heading (use sparingly)

Emphasis

**bold** for key terms on first use or critical claims.
*italic* for titles, foreign terms, or light emphasis.
`code` for inline code, model names as strings, config keys.

Code Blocks

Use fenced code blocks with a language identifier for syntax highlighting:

```python
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("amazon/nova-pro")
```

LaTeX Math

Inline math: $\mathcal{L}(\theta) = -\mathbb{E}[r_\phi(x, y)]$

Display math (on its own line):

 $\mathcal{L}_{\text{RLHF}}(\theta) = -\mathbb{E}_{(x,y) \sim \pi_\theta}\left[r_\phi(x,y)\right] + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

Images

Place images in public/images/blog/ and reference them as:

![Alt text describing the image](/images/blog/my-diagram.png)

Always include meaningful alt text.

Links

[Link text](https://example.com)
[Internal link](/publications)

Tables

| Column A | Column B | Column C |
|---|---|---|
| Row 1 | Value | Value |
| Row 2 | Value | Value |

Blockquotes

Use for quotations or callout notes:

> This is a blockquote. Use it for key quotes from papers or notable claims worth highlighting.

4. Style Guidelines

Tone: Write for a technically literate audience (ML researchers, senior engineers) who are smart but busy. Be direct. Avoid hedging every claim with "it is worth noting that…"

Length: 800–2000 words for most posts. Long-form tutorials can go up to 4000 words. Short takes (300–600 words) are fine if the content warrants it.

Audience assumption: Assume the reader knows basic ML. Don't define "gradient descent" or "transformer" from scratch. Do define domain-specific terms (e.g., GRPO, verifiable rewards) on first use.

Avoid:

Listicles with no analysis ("5 Things About LLMs")
Hedged non-claims ("it depends")
Passive voice throughout
Concluding with "I hope you found this useful"

5. Worked Example

Below is a minimal but complete blog post following this guide:

---
title: "The Difference Between SFT and RLHF (and When It Matters)"
description: "Supervised fine-tuning and RLHF are often conflated. Here is a precise account of what each does, what signal each uses, and when to choose one over the other."
date: 2024-05-20
category: Reinforcement Learning
tags: [rlhf, sft, post-training, alignment]
featured: false
---

Post-training is where the capability of a base model meets the intent of its designers.
Two techniques dominate the landscape: supervised fine-tuning (SFT) and reinforcement
learning from human feedback (RLHF). They are often used together, but they solve
fundamentally different problems.

## What SFT Does

SFT maximises the likelihood of a curated set of (prompt, response) pairs. The signal
is behavioural cloning: show the model the output you want, and train it to reproduce it.

MATH_PLACEHOLDER_1_END

The limitation is obvious: you need labelled demonstrations, and the model learns to
imitate, not to judge. If your demonstrations are wrong, the model learns to be wrong.

## What RLHF Does

RLHF optimises a policy against a reward model trained on human preference comparisons.
The model is free to explore responses the demonstrators never wrote—as long as they
score well on the reward model.

This introduces a new failure mode: reward hacking. The policy finds outputs that the
reward model scores highly but that humans would not prefer. This is not a bug in RLHF;
it is a property of any optimisation process with a proxy objective.

## When to Use Each

| Scenario | Technique |
|---|---|
| Teaching a new format or style | SFT |
| Aligning to nuanced human preferences | RLHF |
| Cold-starting a new task | SFT → RLHF |
| Verifiable correctness (math, code) | RL with ground-truth reward |

## Key Takeaways

- SFT is behaviour cloning. RLHF is preference optimisation.
- Use SFT to teach format and style; use RLHF to align values and preferences.
- Reward hacking is the central challenge of RLHF, not a peripheral concern.

6. Publishing Checklist

Before saving your post, verify:

Frontmatter is complete and valid YAML
date is in YYYY-MM-DD format
File is saved as content/blog/your-post-slug.md
All images are in public/images/ and paths are correct
Run npm run generate-search && npm run generate-rss after saving

Blog Writing Guide & Template

1. Frontmatter Reference

Every blog post must begin with a YAML frontmatter block. Here is a complete example:

---
title: "Why RLHF Works Better Than You Think"
description: "A clear-eyed look at reinforcement learning from human feedback: what it actually optimises for, when it fails, and what comes next."
date: 2024-03-15
category: Reinforcement Learning
tags: [rlhf, alignment, llm, post-training]
featured: false
---

Field	Type	Required	Notes
`title`	string	Yes	Keep under 80 characters. Use sentence case.
`description`	string	Yes	1–2 sentences. Shown in list views and SEO meta.
`date`	YYYY-MM-DD	Yes	Publication date. Posts are sorted by this field.
`category`	string	Yes	A single category (e.g. `Machine Learning`, `Career`).
`tags`	string[]	Yes	2–5 lowercase tags without spaces. Use hyphens.
`featured`	boolean	No	Set `true` to feature on the home page. Defaults to `false`.

2. Recommended Section Structure

A well-structured blog post typically follows this pattern:

Introduction (hook + thesis)

Open with a concrete question, claim, or observation that the reader will care about.
State your main argument clearly within the first three paragraphs.
Avoid preamble ("In this post I will…").

Body (argument + evidence)

Organise into 3–6 sections with clear ## headings.
Each section should make one coherent point.
Use examples, code snippets, or equations to ground abstract claims.
Keep paragraphs short (3–5 sentences).

Conclusion (synthesis + takeaways)

Summarise what you showed, not just what you said.
End with a concrete implication, open question, or call to action.

Key Takeaways (optional but recommended)

A bullet-point summary for readers who skim:

Key Takeaways

RLHF optimises for reward model score, not for human preference directly.

Reward hacking is a feature of the optimisation landscape, not a bug.

Constitutional AI and process reward models address different failure modes.

3. Markdown Formatting Reference

Headings

## Section Heading (use for major sections)
### Subsection Heading (use for sub-points)
#### Minor heading (use sparingly)

Emphasis

**bold** for key terms on first use or critical claims.
*italic* for titles, foreign terms, or light emphasis.
`code` for inline code, model names as strings, config keys.

Code Blocks

Use fenced code blocks with a language identifier for syntax highlighting:

```python
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("amazon/nova-pro")
```

LaTeX Math

Inline math: $\mathcal{L}(\theta) = -\mathbb{E}[r_\phi(x, y)]$

Display math (on its own line):

 $\mathcal{L}_{\text{RLHF}}(\theta) = -\mathbb{E}_{(x,y) \sim \pi_\theta}\left[r_\phi(x,y)\right] + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

Images

Place images in public/images/blog/ and reference them as:

![Alt text describing the image](/images/blog/my-diagram.png)

Always include meaningful alt text.

Links

[Link text](https://example.com)
[Internal link](/publications)

Tables

| Column A | Column B | Column C |
|---|---|---|
| Row 1 | Value | Value |
| Row 2 | Value | Value |

Blockquotes

Use for quotations or callout notes:

> This is a blockquote. Use it for key quotes from papers or notable claims worth highlighting.

4. Style Guidelines

Tone: Write for a technically literate audience (ML researchers, senior engineers) who are smart but busy. Be direct. Avoid hedging every claim with "it is worth noting that…"

Length: 800–2000 words for most posts. Long-form tutorials can go up to 4000 words. Short takes (300–600 words) are fine if the content warrants it.

Avoid:

Listicles with no analysis ("5 Things About LLMs")
Hedged non-claims ("it depends")
Passive voice throughout
Concluding with "I hope you found this useful"

5. Worked Example

Below is a minimal but complete blog post following this guide:

---
title: "The Difference Between SFT and RLHF (and When It Matters)"
description: "Supervised fine-tuning and RLHF are often conflated. Here is a precise account of what each does, what signal each uses, and when to choose one over the other."
date: 2024-05-20
category: Reinforcement Learning
tags: [rlhf, sft, post-training, alignment]
featured: false
---

Post-training is where the capability of a base model meets the intent of its designers.
Two techniques dominate the landscape: supervised fine-tuning (SFT) and reinforcement
learning from human feedback (RLHF). They are often used together, but they solve
fundamentally different problems.

## What SFT Does

SFT maximises the likelihood of a curated set of (prompt, response) pairs. The signal
is behavioural cloning: show the model the output you want, and train it to reproduce it.

MATH_PLACEHOLDER_1_END

The limitation is obvious: you need labelled demonstrations, and the model learns to
imitate, not to judge. If your demonstrations are wrong, the model learns to be wrong.

## What RLHF Does

RLHF optimises a policy against a reward model trained on human preference comparisons.
The model is free to explore responses the demonstrators never wrote—as long as they
score well on the reward model.

This introduces a new failure mode: reward hacking. The policy finds outputs that the
reward model scores highly but that humans would not prefer. This is not a bug in RLHF;
it is a property of any optimisation process with a proxy objective.

## When to Use Each

| Scenario | Technique |
|---|---|
| Teaching a new format or style | SFT |
| Aligning to nuanced human preferences | RLHF |
| Cold-starting a new task | SFT → RLHF |
| Verifiable correctness (math, code) | RL with ground-truth reward |

## Key Takeaways

- SFT is behaviour cloning. RLHF is preference optimisation.
- Use SFT to teach format and style; use RLHF to align values and preferences.
- Reward hacking is the central challenge of RLHF, not a peripheral concern.

6. Publishing Checklist

Before saving your post, verify:

Frontmatter is complete and valid YAML
date is in YYYY-MM-DD format
File is saved as content/blog/your-post-slug.md
All images are in public/images/ and paths are correct
Run npm run generate-search && npm run generate-rss after saving

Blog Writing Guide & Template

Blog Writing Guide & Template

1. Frontmatter Reference

2. Recommended Section Structure

Introduction (hook + thesis)

Body (argument + evidence)

Conclusion (synthesis + takeaways)

Key Takeaways (optional but recommended)

3. Markdown Formatting Reference

Headings

Emphasis

Code Blocks

LaTeX Math

Images

Links

Tables

Blockquotes

4. Style Guidelines

5. Worked Example

6. Publishing Checklist

Related Posts

Paper Review Writing Guide & Template

Blog Writing Guide & Template

Blog Writing Guide & Template

1. Frontmatter Reference

2. Recommended Section Structure

Introduction (hook + thesis)

Body (argument + evidence)

Conclusion (synthesis + takeaways)

Key Takeaways (optional but recommended)

3. Markdown Formatting Reference

Headings

Emphasis

Code Blocks

LaTeX Math

Images

Links

Tables

Blockquotes

4. Style Guidelines

5. Worked Example

6. Publishing Checklist

Related Posts

Paper Review Writing Guide & Template