A comprehensive guide for writing paper reviews on this site, including frontmatter reference, recommended sections, formatting tips, and a complete worked example.

Paper Review Writing Guide & Template

This guide explains how to write and publish a paper review on this site. It covers the required frontmatter fields, recommended section structure, formatting conventions for academic content, and a complete worked example.

1. Frontmatter Reference

Every paper review must begin with a YAML frontmatter block. Here is a complete example:

---
title: "Attention Is All You Need"
authors: ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"]
date: 2024-02-10
venue: NeurIPS 2017
rating: 5
tags: [transformers, attention, nlp, architecture]
arxiv: "https://arxiv.org/abs/1706.03762"
---

Field	Type	Required	Notes
`title`	string	Yes	Exact paper title. Use quotes if it contains special characters.
`authors`	string[]	Yes	Full author list as a YAML array.
`date`	YYYY-MM-DD	Yes	Date you wrote the review (not the paper's publication date).
`venue`	string	Yes	Conference or journal (e.g., `NeurIPS 2023`, `ICML 2024`).
`rating`	integer 1–5	Yes	Your personal rating. 5 = must-read, 1 = weak.
`tags`	string[]	Yes	2–6 lowercase tags without spaces.
`arxiv`	string URL	No	Link to the arXiv preprint, if available.
`pdf`	string URL	No	Link to the PDF, if not on arXiv.

2. Recommended Section Structure

Paper Summary

A 2–4 sentence abstract in your own words. What problem does this paper address? What is the key idea? This is not a copy of the abstract—it is your synthesis.

Key Contributions

A short bullet list of the paper's main technical contributions. Be specific. Avoid vague phrases like "proposes a new method." Say what the method does and why it is novel.

Methodology

Explain the core technical approach in enough detail that a reader who has not read the paper can understand the key ideas. Use equations where they add precision. Diagrams can be described or linked.

This is where you demonstrate your understanding of the paper. Go beyond what the abstract says.

Strengths

What does this paper do well? Consider:

Novelty of the idea
Strength of the empirical results
Clarity of the writing and exposition
Reproducibility and released code/data

Weaknesses / Limitations

What are the paper's shortcomings? Consider:

Missing baselines or unfair comparisons
Narrow experimental scope
Scalability assumptions that may not hold
Claims not fully supported by results

This section is important. A review that only praises a paper is not useful. Be constructive, not dismissive.

Key Takeaways

3–5 bullet points: the things you want to remember from this paper a year from now.

My Rating

Rating	Meaning
⭐⭐⭐⭐⭐	Must-read. Foundational or landmark paper.
⭐⭐⭐⭐	Highly recommended. Strong contribution.
⭐⭐⭐	Worth reading if the topic is relevant to you.
⭐⭐	Limited contribution. Read the abstract and skip.
⭐	Not recommended. Significant flaws.

3. Formatting Reference

Math Equations

For inline math, use single dollar signs: $L_{\text{KL}}$

For display equations, use double dollar signs on their own line:

 $\mathcal{L}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\log \pi_\theta(y \mid x)\right]$

Always define all symbols the first time you introduce them.

Figures and Diagrams

If referencing a figure from the paper, describe it in words rather than copying it directly:

Figure 3 in the paper shows that scaling compute by 10× yields only a 1.5× improvement in benchmark performance beyond 10B parameters, suggesting diminishing returns in the over-trained regime.

Citations

Reference papers inline with author-year format and link to arXiv where possible:

This builds on the work of [Wei et al. (2022)](https://arxiv.org/abs/2201.11903) on chain-of-thought prompting.

Code Snippets

If the paper introduces an algorithm, you can illustrate it with pseudocode:

```python
# Simplified GRPO policy update
for step in range(num_steps):
    responses = sample_responses(policy, prompts, G)
    rewards = reward_model(responses)
    advantages = compute_group_relative_advantage(rewards)
    loss = policy_gradient_loss(log_probs, advantages, kl_penalty)
    loss.backward()
    optimizer.step()
```

4. Writing Style Guidelines

Audience: Assume the reader is an ML researcher or senior engineer. You do not need to explain what a transformer or attention mechanism is. Do explain paper-specific notations and design choices.

Tone: Analytical, not promotional. The goal is to help readers decide whether to read the paper, not to advertise it. Be honest about weaknesses.

Length: 600–1500 words for a typical review. Landmark papers may warrant 2000+ words.

Academic vs. accessible balance: Use precise technical language, but avoid jargon for its own sake. If you use an acronym, define it on first use. Write as if explaining to a smart colleague who hasn't seen this paper yet.

Avoid:

Restating the abstract verbatim
Listing results without interpreting them
Praising novelty without explaining why something is novel
Vague criticisms ("the experiments could be more thorough")

5. Worked Example

Below is a minimal but complete paper review following this guide:

---
title: "Training Language Models to Follow Instructions with Human Feedback"
authors: ["Long Ouyang", "Jeff Wu", "Xu Jiang", "Diogo Almeida", "Carroll L. Wainwright", "Pamela Mishkin", "Chong Zhang", "Sandhini Agarwal", "Katarina Slama", "Alex Ray", "John Schulman", "Jacob Hilton", "Fraser Kelton", "Luke Miller", "Maddie Simens", "Amanda Askell", "Peter Welinder", "Paul F. Christiano", "Jan Leike", "Ryan Lowe"]
date: 2024-01-15
venue: NeurIPS 2022
rating: 5
tags: [rlhf, alignment, instruct, openai, llm]
arxiv: "https://arxiv.org/abs/2203.02155"
---

## Paper Summary

This paper introduces InstructGPT, the first large-scale demonstration that RLHF (reinforcement
learning from human feedback) substantially improves the alignment of language models with user
intent. The authors fine-tune GPT-3 using SFT on human demonstrations, train a reward model
from human preference comparisons, and then apply PPO to optimise the policy against the reward
model. The resulting model is strongly preferred by human evaluators despite being 100× smaller
than the base GPT-3 model it was compared against.

## Key Contributions

- End-to-end pipeline for RLHF applied to a production-scale LLM (1.3B–175B parameters)
- Empirical evidence that RLHF alignment does not significantly hurt performance on NLP benchmarks (the "alignment tax")
- Demonstration that a 1.3B InstructGPT model is preferred to GPT-3 175B by human raters
- Public release of the preference dataset and methodology details

## Methodology

The training pipeline has three stages:

**Stage 1 — Supervised Fine-Tuning (SFT):**
Human labellers write ideal responses to a sampled set of prompts. The base GPT-3 model
is fine-tuned on these demonstrations.

**Stage 2 — Reward Model Training:**
For each prompt, the SFT model generates multiple candidate responses. Labellers rank
them by quality. A reward model MATH_PLACEHOLDER_4_END is trained to predict the preferred response
using a pairwise comparison loss:

MATH_PLACEHOLDER_1_END

where MATH_PLACEHOLDER_5_END is the preferred response and MATH_PLACEHOLDER_6_END is the less preferred response.

**Stage 3 — PPO Fine-Tuning:**
The SFT model is further fine-tuned using PPO to maximise the reward model score,
with a KL penalty to prevent the policy from diverging too far from the SFT initialisation:

MATH_PLACEHOLDER_2_END

## Strengths

- **Impressive scale:** Demonstrates RLHF at the scale of GPT-3, which required significant engineering investment. This is not a toy experiment.
- **Honest evaluation:** The authors measure the "alignment tax" on standard NLP benchmarks and find it is modest—an important result for practitioners.
- **Human preference data at scale:** The dataset and methodology are described in enough detail to be reproduced.

## Weaknesses / Limitations

- **Reward model generalisation:** The reward model is trained on a narrow distribution of prompts from the OpenAI API. Its preferences may not generalise to out-of-distribution inputs.
- **Labeller subjectivity:** The paper acknowledges that labeller preferences encode specific values and demographics. This is not addressed beyond acknowledgment.
- **PPO instability:** The paper notes that PPO training is unstable and required significant hyperparameter tuning, but does not publish full details.

## Key Takeaways

- RLHF with a relatively small number of human comparisons (~50K) can dramatically improve instruction-following.
- A smaller model aligned with RLHF outperforms a much larger unaligned model on human preference evaluations.
- The alignment tax on NLP benchmarks is real but modest—a key result for practitioners.
- Reward hacking and reward model generalisation remain open problems.
- This paper established the RLHF pipeline that became standard across the industry.

## My Rating

⭐⭐⭐⭐⭐ — Foundational paper. Required reading for anyone working on post-training or alignment.

6. Publishing Checklist

Before saving your review, verify:

Frontmatter is complete and valid YAML
authors is a YAML array (not a comma-separated string)
rating is an integer between 1 and 5
File is saved as content/paper-reviews/paper-title-slug.md
All equations compile correctly (preview with npm run dev)
Run npm run generate-search && npm run generate-rss after saving

Paper Review Writing Guide & Template

1. Frontmatter Reference

Every paper review must begin with a YAML frontmatter block. Here is a complete example:

---
title: "Attention Is All You Need"
authors: ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"]
date: 2024-02-10
venue: NeurIPS 2017
rating: 5
tags: [transformers, attention, nlp, architecture]
arxiv: "https://arxiv.org/abs/1706.03762"
---

Field	Type	Required	Notes
`title`	string	Yes	Exact paper title. Use quotes if it contains special characters.
`authors`	string[]	Yes	Full author list as a YAML array.
`date`	YYYY-MM-DD	Yes	Date you wrote the review (not the paper's publication date).
`venue`	string	Yes	Conference or journal (e.g., `NeurIPS 2023`, `ICML 2024`).
`rating`	integer 1–5	Yes	Your personal rating. 5 = must-read, 1 = weak.
`tags`	string[]	Yes	2–6 lowercase tags without spaces.
`arxiv`	string URL	No	Link to the arXiv preprint, if available.
`pdf`	string URL	No	Link to the PDF, if not on arXiv.

2. Recommended Section Structure

Paper Summary

A 2–4 sentence abstract in your own words. What problem does this paper address? What is the key idea? This is not a copy of the abstract—it is your synthesis.

Key Contributions

A short bullet list of the paper's main technical contributions. Be specific. Avoid vague phrases like "proposes a new method." Say what the method does and why it is novel.

Methodology

This is where you demonstrate your understanding of the paper. Go beyond what the abstract says.

Strengths

What does this paper do well? Consider:

Novelty of the idea
Strength of the empirical results
Clarity of the writing and exposition
Reproducibility and released code/data

Weaknesses / Limitations

What are the paper's shortcomings? Consider:

Missing baselines or unfair comparisons
Narrow experimental scope
Scalability assumptions that may not hold
Claims not fully supported by results

This section is important. A review that only praises a paper is not useful. Be constructive, not dismissive.

Key Takeaways

3–5 bullet points: the things you want to remember from this paper a year from now.

My Rating

Rating	Meaning
⭐⭐⭐⭐⭐	Must-read. Foundational or landmark paper.
⭐⭐⭐⭐	Highly recommended. Strong contribution.
⭐⭐⭐	Worth reading if the topic is relevant to you.
⭐⭐	Limited contribution. Read the abstract and skip.
⭐	Not recommended. Significant flaws.

3. Formatting Reference

Math Equations

For inline math, use single dollar signs: $L_{\text{KL}}$

For display equations, use double dollar signs on their own line:

 $\mathcal{L}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\log \pi_\theta(y \mid x)\right]$

Always define all symbols the first time you introduce them.

Figures and Diagrams

If referencing a figure from the paper, describe it in words rather than copying it directly:

Figure 3 in the paper shows that scaling compute by 10× yields only a 1.5× improvement in benchmark performance beyond 10B parameters, suggesting diminishing returns in the over-trained regime.

Citations

Reference papers inline with author-year format and link to arXiv where possible:

This builds on the work of [Wei et al. (2022)](https://arxiv.org/abs/2201.11903) on chain-of-thought prompting.

Code Snippets

If the paper introduces an algorithm, you can illustrate it with pseudocode:

```python
# Simplified GRPO policy update
for step in range(num_steps):
    responses = sample_responses(policy, prompts, G)
    rewards = reward_model(responses)
    advantages = compute_group_relative_advantage(rewards)
    loss = policy_gradient_loss(log_probs, advantages, kl_penalty)
    loss.backward()
    optimizer.step()
```

4. Writing Style Guidelines

Tone: Analytical, not promotional. The goal is to help readers decide whether to read the paper, not to advertise it. Be honest about weaknesses.

Length: 600–1500 words for a typical review. Landmark papers may warrant 2000+ words.

Avoid:

Restating the abstract verbatim
Listing results without interpreting them
Praising novelty without explaining why something is novel
Vague criticisms ("the experiments could be more thorough")

5. Worked Example

Below is a minimal but complete paper review following this guide:

---
title: "Training Language Models to Follow Instructions with Human Feedback"
authors: ["Long Ouyang", "Jeff Wu", "Xu Jiang", "Diogo Almeida", "Carroll L. Wainwright", "Pamela Mishkin", "Chong Zhang", "Sandhini Agarwal", "Katarina Slama", "Alex Ray", "John Schulman", "Jacob Hilton", "Fraser Kelton", "Luke Miller", "Maddie Simens", "Amanda Askell", "Peter Welinder", "Paul F. Christiano", "Jan Leike", "Ryan Lowe"]
date: 2024-01-15
venue: NeurIPS 2022
rating: 5
tags: [rlhf, alignment, instruct, openai, llm]
arxiv: "https://arxiv.org/abs/2203.02155"
---

## Paper Summary

This paper introduces InstructGPT, the first large-scale demonstration that RLHF (reinforcement
learning from human feedback) substantially improves the alignment of language models with user
intent. The authors fine-tune GPT-3 using SFT on human demonstrations, train a reward model
from human preference comparisons, and then apply PPO to optimise the policy against the reward
model. The resulting model is strongly preferred by human evaluators despite being 100× smaller
than the base GPT-3 model it was compared against.

## Key Contributions

- End-to-end pipeline for RLHF applied to a production-scale LLM (1.3B–175B parameters)
- Empirical evidence that RLHF alignment does not significantly hurt performance on NLP benchmarks (the "alignment tax")
- Demonstration that a 1.3B InstructGPT model is preferred to GPT-3 175B by human raters
- Public release of the preference dataset and methodology details

## Methodology

The training pipeline has three stages:

**Stage 1 — Supervised Fine-Tuning (SFT):**
Human labellers write ideal responses to a sampled set of prompts. The base GPT-3 model
is fine-tuned on these demonstrations.

**Stage 2 — Reward Model Training:**
For each prompt, the SFT model generates multiple candidate responses. Labellers rank
them by quality. A reward model MATH_PLACEHOLDER_4_END is trained to predict the preferred response
using a pairwise comparison loss:

MATH_PLACEHOLDER_1_END

where MATH_PLACEHOLDER_5_END is the preferred response and MATH_PLACEHOLDER_6_END is the less preferred response.

**Stage 3 — PPO Fine-Tuning:**
The SFT model is further fine-tuned using PPO to maximise the reward model score,
with a KL penalty to prevent the policy from diverging too far from the SFT initialisation:

MATH_PLACEHOLDER_2_END

## Strengths

- **Impressive scale:** Demonstrates RLHF at the scale of GPT-3, which required significant engineering investment. This is not a toy experiment.
- **Honest evaluation:** The authors measure the "alignment tax" on standard NLP benchmarks and find it is modest—an important result for practitioners.
- **Human preference data at scale:** The dataset and methodology are described in enough detail to be reproduced.

## Weaknesses / Limitations

- **Reward model generalisation:** The reward model is trained on a narrow distribution of prompts from the OpenAI API. Its preferences may not generalise to out-of-distribution inputs.
- **Labeller subjectivity:** The paper acknowledges that labeller preferences encode specific values and demographics. This is not addressed beyond acknowledgment.
- **PPO instability:** The paper notes that PPO training is unstable and required significant hyperparameter tuning, but does not publish full details.

## Key Takeaways

- RLHF with a relatively small number of human comparisons (~50K) can dramatically improve instruction-following.
- A smaller model aligned with RLHF outperforms a much larger unaligned model on human preference evaluations.
- The alignment tax on NLP benchmarks is real but modest—a key result for practitioners.
- Reward hacking and reward model generalisation remain open problems.
- This paper established the RLHF pipeline that became standard across the industry.

## My Rating

⭐⭐⭐⭐⭐ — Foundational paper. Required reading for anyone working on post-training or alignment.

6. Publishing Checklist

Before saving your review, verify:

Frontmatter is complete and valid YAML
authors is a YAML array (not a comma-separated string)
rating is an integer between 1 and 5
File is saved as content/paper-reviews/paper-title-slug.md
All equations compile correctly (preview with npm run dev)
Run npm run generate-search && npm run generate-rss after saving

Paper Review Writing Guide & Template

Paper Review Writing Guide & Template

1. Frontmatter Reference

2. Recommended Section Structure

Paper Summary

Key Contributions

Methodology

Strengths

Weaknesses / Limitations

Key Takeaways

My Rating

3. Formatting Reference

Math Equations

Figures and Diagrams

Citations

Code Snippets

4. Writing Style Guidelines

5. Worked Example

6. Publishing Checklist

Related Posts

Blog Writing Guide & Template

Paper Review Writing Guide & Template

Paper Review Writing Guide & Template

1. Frontmatter Reference

2. Recommended Section Structure

Paper Summary

Key Contributions

Methodology

Strengths

Weaknesses / Limitations

Key Takeaways

My Rating

3. Formatting Reference

Math Equations

Figures and Diagrams

Citations

Code Snippets

4. Writing Style Guidelines

5. Worked Example

6. Publishing Checklist

Related Posts

Blog Writing Guide & Template