Paper Review Writing Guide & Template
This guide explains how to write and publish a paper review on this site. It covers the required frontmatter fields, recommended section structure, formatting conventions for academic content, and a complete worked example.
1. Frontmatter Reference
Every paper review must begin with a YAML frontmatter block. Here is a complete example:
---
title: "Attention Is All You Need"
authors: ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"]
date: 2024-02-10
venue: NeurIPS 2017
rating: 5
tags: [transformers, attention, nlp, architecture]
arxiv: "https://arxiv.org/abs/1706.03762"
---
| Field | Type | Required | Notes |
|---|---|---|---|
title |
string | Yes | Exact paper title. Use quotes if it contains special characters. |
authors |
string[] | Yes | Full author list as a YAML array. |
date |
YYYY-MM-DD | Yes | Date you wrote the review (not the paper's publication date). |
venue |
string | Yes | Conference or journal (e.g., NeurIPS 2023, ICML 2024). |
rating |
integer 1–5 | Yes | Your personal rating. 5 = must-read, 1 = weak. |
tags |
string[] | Yes | 2–6 lowercase tags without spaces. |
arxiv |
string URL | No | Link to the arXiv preprint, if available. |
pdf |
string URL | No | Link to the PDF, if not on arXiv. |
2. Recommended Section Structure
Paper Summary
A 2–4 sentence abstract in your own words. What problem does this paper address? What is the key idea? This is not a copy of the abstract—it is your synthesis.
Key Contributions
A short bullet list of the paper's main technical contributions. Be specific. Avoid vague phrases like "proposes a new method." Say what the method does and why it is novel.
Methodology
Explain the core technical approach in enough detail that a reader who has not read the paper can understand the key ideas. Use equations where they add precision. Diagrams can be described or linked.
This is where you demonstrate your understanding of the paper. Go beyond what the abstract says.
Strengths
What does this paper do well? Consider:
- Novelty of the idea
- Strength of the empirical results
- Clarity of the writing and exposition
- Reproducibility and released code/data
Weaknesses / Limitations
What are the paper's shortcomings? Consider:
- Missing baselines or unfair comparisons
- Narrow experimental scope
- Scalability assumptions that may not hold
- Claims not fully supported by results
This section is important. A review that only praises a paper is not useful. Be constructive, not dismissive.
Key Takeaways
3–5 bullet points: the things you want to remember from this paper a year from now.
My Rating
| Rating | Meaning |
|---|---|
| ⭐⭐⭐⭐⭐ | Must-read. Foundational or landmark paper. |
| ⭐⭐⭐⭐ | Highly recommended. Strong contribution. |
| ⭐⭐⭐ | Worth reading if the topic is relevant to you. |
| ⭐⭐ | Limited contribution. Read the abstract and skip. |
| ⭐ | Not recommended. Significant flaws. |
3. Formatting Reference
Math Equations
For inline math, use single dollar signs: $L_{\text{KL}}$
For display equations, use double dollar signs on their own line:
$
\mathcal{L}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\log \pi_\theta(y \mid x)\right]
$
Always define all symbols the first time you introduce them.
Figures and Diagrams
If referencing a figure from the paper, describe it in words rather than copying it directly:
Figure 3 in the paper shows that scaling compute by 10× yields only a 1.5× improvement in benchmark performance beyond 10B parameters, suggesting diminishing returns in the over-trained regime.
Citations
Reference papers inline with author-year format and link to arXiv where possible:
This builds on the work of [Wei et al. (2022)](https://arxiv.org/abs/2201.11903) on chain-of-thought prompting.
Code Snippets
If the paper introduces an algorithm, you can illustrate it with pseudocode:
```python
# Simplified GRPO policy update
for step in range(num_steps):
responses = sample_responses(policy, prompts, G)
rewards = reward_model(responses)
advantages = compute_group_relative_advantage(rewards)
loss = policy_gradient_loss(log_probs, advantages, kl_penalty)
loss.backward()
optimizer.step()
```
4. Writing Style Guidelines
Audience: Assume the reader is an ML researcher or senior engineer. You do not need to explain what a transformer or attention mechanism is. Do explain paper-specific notations and design choices.
Tone: Analytical, not promotional. The goal is to help readers decide whether to read the paper, not to advertise it. Be honest about weaknesses.
Length: 600–1500 words for a typical review. Landmark papers may warrant 2000+ words.
Academic vs. accessible balance: Use precise technical language, but avoid jargon for its own sake. If you use an acronym, define it on first use. Write as if explaining to a smart colleague who hasn't seen this paper yet.
Avoid:
- Restating the abstract verbatim
- Listing results without interpreting them
- Praising novelty without explaining why something is novel
- Vague criticisms ("the experiments could be more thorough")
5. Worked Example
Below is a minimal but complete paper review following this guide:
---
title: "Training Language Models to Follow Instructions with Human Feedback"
authors: ["Long Ouyang", "Jeff Wu", "Xu Jiang", "Diogo Almeida", "Carroll L. Wainwright", "Pamela Mishkin", "Chong Zhang", "Sandhini Agarwal", "Katarina Slama", "Alex Ray", "John Schulman", "Jacob Hilton", "Fraser Kelton", "Luke Miller", "Maddie Simens", "Amanda Askell", "Peter Welinder", "Paul F. Christiano", "Jan Leike", "Ryan Lowe"]
date: 2024-01-15
venue: NeurIPS 2022
rating: 5
tags: [rlhf, alignment, instruct, openai, llm]
arxiv: "https://arxiv.org/abs/2203.02155"
---
## Paper Summary
This paper introduces InstructGPT, the first large-scale demonstration that RLHF (reinforcement
learning from human feedback) substantially improves the alignment of language models with user
intent. The authors fine-tune GPT-3 using SFT on human demonstrations, train a reward model
from human preference comparisons, and then apply PPO to optimise the policy against the reward
model. The resulting model is strongly preferred by human evaluators despite being 100× smaller
than the base GPT-3 model it was compared against.
## Key Contributions
- End-to-end pipeline for RLHF applied to a production-scale LLM (1.3B–175B parameters)
- Empirical evidence that RLHF alignment does not significantly hurt performance on NLP benchmarks (the "alignment tax")
- Demonstration that a 1.3B InstructGPT model is preferred to GPT-3 175B by human raters
- Public release of the preference dataset and methodology details
## Methodology
The training pipeline has three stages:
**Stage 1 — Supervised Fine-Tuning (SFT):**
Human labellers write ideal responses to a sampled set of prompts. The base GPT-3 model
is fine-tuned on these demonstrations.
**Stage 2 — Reward Model Training:**
For each prompt, the SFT model generates multiple candidate responses. Labellers rank
them by quality. A reward model MATH_PLACEHOLDER_4_END is trained to predict the preferred response
using a pairwise comparison loss:
MATH_PLACEHOLDER_1_END
where MATH_PLACEHOLDER_5_END is the preferred response and MATH_PLACEHOLDER_6_END is the less preferred response.
**Stage 3 — PPO Fine-Tuning:**
The SFT model is further fine-tuned using PPO to maximise the reward model score,
with a KL penalty to prevent the policy from diverging too far from the SFT initialisation:
MATH_PLACEHOLDER_2_END
## Strengths
- **Impressive scale:** Demonstrates RLHF at the scale of GPT-3, which required significant engineering investment. This is not a toy experiment.
- **Honest evaluation:** The authors measure the "alignment tax" on standard NLP benchmarks and find it is modest—an important result for practitioners.
- **Human preference data at scale:** The dataset and methodology are described in enough detail to be reproduced.
## Weaknesses / Limitations
- **Reward model generalisation:** The reward model is trained on a narrow distribution of prompts from the OpenAI API. Its preferences may not generalise to out-of-distribution inputs.
- **Labeller subjectivity:** The paper acknowledges that labeller preferences encode specific values and demographics. This is not addressed beyond acknowledgment.
- **PPO instability:** The paper notes that PPO training is unstable and required significant hyperparameter tuning, but does not publish full details.
## Key Takeaways
- RLHF with a relatively small number of human comparisons (~50K) can dramatically improve instruction-following.
- A smaller model aligned with RLHF outperforms a much larger unaligned model on human preference evaluations.
- The alignment tax on NLP benchmarks is real but modest—a key result for practitioners.
- Reward hacking and reward model generalisation remain open problems.
- This paper established the RLHF pipeline that became standard across the industry.
## My Rating
⭐⭐⭐⭐⭐ — Foundational paper. Required reading for anyone working on post-training or alignment.
6. Publishing Checklist
Before saving your review, verify:
- Frontmatter is complete and valid YAML
-
authorsis a YAML array (not a comma-separated string) -
ratingis an integer between 1 and 5 - File is saved as
content/paper-reviews/paper-title-slug.md - All equations compile correctly (preview with
npm run dev) - Run
npm run generate-search && npm run generate-rssafter saving