Advice on writing a scientific paper (in academia)

I recommend reading Strunk and White before writing anything: it's very concise and has good general guidelines. It's online here. I recommend reading Politics and the English Language before writing to make a point: it's a good argument for clarity and simplicity, drawing attention to common silly things people to do make their writing sound impressive. It's online here and a few of my favourite bits are here.

Use a question-driven approach

Start by getting as clear as you can in your head about what question your work addresses. Anecdotally, I find a fairly common structure in talks (less so papers) by junior scientists is

An introduction to the general area of study,
What you did, in great detail (before you learn from experience that while this shows how much you had to struggle, everyone struggles at the start of something new, and these details can usually be summarised or omitted with no loss of your audience's understanding),
What the result of doing that was,
Speculation as to why that might be of interest to the general area: which hypotheses the result (dis)favours.

That’s not a compelling structure, because as one follows, it’s less clear why you’re doing what you’re doing. Finish your introduction with the specific question you attempted to answer, explaining why that question matters. This means that people have understood where you're trying to get to as they then follow your explanation for how you tried to get there, which is much clearer and more satisfying. It means people are either hooked from your introduction onwards, or they realise as early as possible that what you’ve done isn’t relevant for them and they can save their mental bandwidth for something else. There are too many papers and too many talks: help people to know if yours is right for them. I suspect that the uncompelling structure above arises because the author is writing too narrowly from their own perspective (which is also responsible for failure to explain things that are very familiar to the author). An important point beyond the use of a question-driven approach is to identify your audience and write your paper bearing in mind their perspective.

The question you present is not necessarily the same question that you actually set out to answer at the start of your project: there's nothing wrong with finding some result and then realising afterwards that it answers someone else’s important question. And you might want to revisit the question you started with depending what you found, e.g. restrict the scope if you realise there are problems with generalisability.

I feel (I am perhaps biased) that an exception to this question-focussed approach is when you discover a thing that exists (or existed) that wasn’t known to exist, e.g. a viral variant with certain properties, or a volcano, or an ancient language. There’s not necessarily an obvious question preceding such findings. One could try to shoehorn any finding into this exception, e.g. “I discovered that there exists a statistically significant correlation between variables x and y in my data,” but the fact that you could do something doesn't mean you should. Generally, people are interested in what you find in your data only to the extent that we can generalise from it to something bigger. Which means e.g. considering whether the data you started with was a good thing to start with for the question in hand. So it's better not to shoehorn but to challenge oneself to construct a question to which one’s findings provide a reasonable answer (such as does actionable thing X cause desirable outcome Y).

I've heard people say that scientific work should always be testing some hypothesis. I heard one reply providing the counter-example of counting species in a given area to record biodiversity: there is no hypothesis being tested. Both sides here have a point. If you fish around in your data for anything interesting and then later construct some explanation for what you found, you're at risk of misinterpreting random chance in the data, worsening the reproducibility crisis. However, reducing science to the 'testing' of a hypothesis is overly reductionist $^\dagger$. Scientific work should aim to discriminate between different hypotheses, and these may form a continuous space rather than just two possibilities, such as the spatial density of species. This is similar to saying we should start from a question: we should be investigating how the evidence discriminates between different possible answers.

Structure your writing at the level of paragraphs

Plan what you are going to write, and make sure you understand what you have already written, at the level of paragraphs. Each paragraph should

collect sentences that are related to each other with an easily definable theme
avoid repeating something from a previous paragraph (unless this is intentional for emphasis, especially between Discussion and Results),
come in a coherent flow of ideas from one paragraph to the next.

To ensure that your writing is structured at the level of paragraphs, maintain a list of very short bullet points saying what the main point of each paragraph is. You'll need to keep that list in sync with the actual paragraphs as you write them - switching between modifying the structure and modifying the actual text until the two things match each other and the plan matches the structure you want. Depending on what mood you're currently in, it can be better to plan your structure (big-picture thinking) or craft text to match your desired structure (fleshing out details). Keep that plan visible when you share your draft with co-authors: it's helpful for them to see at a glance what your structure is supposed to be, instead of them each needing to figure it out. If you start the writing process by creating a draft of this list, that’s a great point at which to get early feedback from co-authors, because the whole point and structure of the paper is clear at a glance and can easily be turned into something different with minimal wasted writing.

As an example of this approach, I might start by writing this paragraph structure for my Discussion section:

summary: what we did, what we found, how we interpret that finding
refer to previous studies
limitation: non-representative data limits generalisability
limitation: demonstrated correlation not causation
outlook, what should (some) people do now in light of this result

I might show this to a coauthor who can tell me immediately that our data is much more representative that I thought, so there's no need to write paragraph 4 (saving my time writing it and her time reading it). Then I might start to write the actual paragraphs and realise there's so much to say on interpretation that a whole paragraph would be better, in which we should talk about novelty, because interpretation and novelty both help people understand why they should care about this result. So I'd revise my structure to

summary of what we did and what we found
why we care: how we interpret that finding and why it's novel
refer to previous studies
limitation: correlation not causation
outlook, what should (some) people do now in light of this result

(Variations on the above structures generally work well for the results section of a results-focussed paper; adjustment is needed for a methods-focussed paper.)

Another example: I might take a section of my paper that I already wrote without an explicit structure accompanying it, and then try to write such a list to match what I've already written. As a result I might identify some problems with my existing structure, e.g. one paragraph without an easily identifiable theme, or one point repeated in multiple paragraphs when those sentences would be better collected together. I would then move around sentences (or delete them) until the text has a sensible paragraph structure.

Having an explicit written list for what the paragraph structure is will help you and your coauthors write text with good structure. Of course you won't include that list with the actual final text, and so your broad readership won't see quite so easily what the paragraph structure is. To help them, try to make the opening sentence of each paragraph indicate what the rest of the rest of the paragraph will be about.

The Introduction section should have a roughly funnel structure from broad to specific, ending with an explanation of what you set out to do in this paper: “Here, we...“. I read somewhere, and agree, that the funnel should be such that the closing bit about your work should feel inevitable - you’ve drawn the reader towards your question feeling like the natural thing to work on. In the best case you have provided something completely novel, in which case immediately before your “Here, we...” is an explanation of how no one has looked at this problem before. Normally your result is more incremental, and you can say that uncertainty remains on this particular aspect of a phenomenon, or that the problem is so important that further studies are useful to validate existing results. An example funnel:

one paragraph on the COVID-19 pandemic in England and Wales
one paragraph on contact tracing as an approach to reduce the spread of COVID-19
one paragraph on digital contact tracing (using proximity-detecting smartphone apps)
one paragraph on previous studies of digital contact tracing, mentioning the absence of and the importance of empirical estimates of epidemiological impact
"Here, we empirically estimated the epidemiological impact of digital contact tracing in England and Wales. We do this by... [very brief summary of data and method]"

Other considerations

CLARITY CLARITY CLARITY. Write to be understood as clearly as possible. Do not use your writing style to try to make the work sound more impressive. If impressing people is your aim, have something impressive to say (in which case being as clear as possible helps your aim); if you don't, using an impressive-sounding style makes you a salesperson for a crappy car - it's not a good look. Use methods and analyse data: don't leverage them or harness them. A methodology is a class of methods - are you sure that's what you mean rather than a method? A single method can still have aspects of it that can be adjusted in different applications; this doesn't mean we should promote it to being a class of methods. Methodologies can almost always be replaced by methods.

Writing the Methods section is the easiest thing to do: it’s just describing the relevant parts of what you did in a way that others can understand. I’ve heard it recommended as a starting point for that reason - start with something easy to get your foot into in the writing process. However, grappling with the construction of a narrative for your work may make you realise that part of what you’ve done is unnecessary, or that something you haven’t done is necessary, or that it would be better to split into several papers. So I think it’s better to work on the big picture before diving into easy details.

What I look for in every presentation or paper is what should people do differently as a result of what you’ve found. For a methods paper, it’s normally to use that method instead of the best previous alternative for this particular application. For applied results papers, it might be that doctors/society should consider using more X to get more of desirable outcome Y, or at least that further study should establish your X-Y link more concretely and explore feasibility, cost-effectiveness etc. For theoretical results papers, it might be that other researchers should investigate your new idea because it has the potential to help us understand something which is manifestly important. For a convincing negative result - enough data for statistical power, appropriate kind of data & method for accuracy and generalisability - the implication might be that people should stop pursuing an avenue previously thought to be promising. There are lots of possibilities for how your finding might influence some people, but if you struggle to think concretely of who and how, it’s worth having a frank conversation with yourself about whether your paper was a good use of your time. No use crying over spilt milk of course, but one can strive to not keep spilling milk forever. (This was a big factor in why I stopped working on theoretical particle physics.)

Clearly establish and then stick to a one-to-one mapping between [things] and [the names we use to refer to those things]. i.e. avoid using different terms to refer to the same thing, and avoid using the same term to mean different things. A one-to-one map makes it as clear as possible for the reader and frees up their concentration for more important things. Expanding on this, repetition - in the sense of repeating your decision how to write a particular thing, not repeating the same point - helps the reader to see the connections between different places in your writing. This goes for structure as well as choice of names (see Strunk and White rule 19: Express coordinate ideas in similar form). Compare:
Authors A used method a and found result X [A]. Result Y was obtained [b] leveraging methodology b. Employing c, C et al [C] demonstrated Z.
with something like
A et al used method a and found X [A]. B et al used method b and found Y [B]. C et al used method c and found Z [C].
The latter is certainly drier but easier to parse. Recall that coercing data into tidy format (rectangular, one row per observation, one column per aspect of the observation i.e. per variable) makes the data clearer as well as easier to operate on (though obviously some data is naturally of a different structure, such as trees). In the same way, coercing two or more similar sentences (or clauses within a sentence) into a common structure helps to clarify both their similarities and their differences for lower mental effort.

Tenses: I like past tense for Methods and Results, present (or future) tense for Discussion. Paraphrasing an explanation I was given: mathematicians and physicists favour the present tense for results because there is a feel of timelessness to the findings. However, in biomedical research, different findings are fairly often in some tension with each other; a single finding is just a higher level of a single observation within a study.
Methods: We regressed y against x.
Results: We found a correlation.
Discussion: We interpret this as evidence that x causes y, given the assumed absence of confounders.
And use the active tense for what you did (We did X); recommendations to use the passive tense (X was done) to sound more neutral are so last century. I read somewhere that the active tense reminds the reader that the steps taken were made by fallible humans, not by some idealised notion of the scientific process. I agree. There can also be genuine ambiguity about who did the thing that was done if you're also talking about previous work.

Report results to the appropriate level of precision. For numerical values, that's a choice for the number of significant figures. (If you're unfamiliar with the concept, you should google it, though a basic example may suffice: 36, 36000, 3.6 x 10^5, and 0.00036 are all quoted to 2 significant figures.) Two factors are relevant: the precision with which you know the value, and the maximum precision that any of your readers might care about (remembering that the most interested readers will want to dive into the underlying data and not just copy-paste your derived values). You should use the smaller of these two. A first example: if my friend asks me when I'm going on holiday, I would not reply 09:39 on August 3rd even though I know my departure time to that level of precision: that would be more precision than they care about. A second example: if I observe 100 successes out of 300 trials, it would be silly to report the success probability as 0.33333333 not only because no-one cares about the later digits but because I am extremely uncertain what their values should be. For simplicity let us say I report Frequentist confidence intervals as a quantification of uncertainty (a common abuse of the concept of uncertainty), obtaining the interval 0.2808136 - 0.3901981 from running prop.test(100, 300) in R. This range is adequately summarised as 0.28 - 0.39, because using a third significant figure would be negligibly small compared to the size of the range itself. Then for consistency we should quote our central estimate to two significant figures too: we report our estimate of the success probability as 0.33 (95% CI: 0.28 - 0.39). If reporting p-values (read elsewhere, e.g. here, for issues with that), one significant figure is usually best, occasionally two, never more. No-one's conclusion will be different if you report p = 0.0589 instead of p = 0.06. An exception to these considerations is tables of numerical values: where these are automatically generated and contain many values, it would take a lot of work (and perhaps look a little untidy) to adjust the number of significant figures for every cell in the table independently. In this case it is forgivable for the author to make one choice of the most appropriate number of significant figures (or decimal places) and apply this to the whole table (or to each row or to each column, where these have values with very different sizes).

Finally, a point of grammar which comes up all the time in technical writing: hyphenation of modifiers. When we write A B C where C is a noun and A and B are modifiers, the rule is to hyphenate A-B if and only if A and B work together to form a single composite modifier (the most common example being that A modifies B), not hyphenating if and only if A and B separately modify C. e.g. a strange smelling cheese is a cheese that is strange and possesses the ability to smell, whereas a strange-smelling cheese is a cheese that smells strange. I like the recommendation of the Chicago Manual of Style to break this rule when A is an adverb that ends in "ly", because ending in "ly" means it's obviously an adverb and adverbs only modify adjectives or verbs, never nouns, so there's no ambiguity about what A is modifying. e.g. a "significantly improved method" is just as clear and more stylish than a "significantly-improved method". Hyphenation is for when the modifiers precede the noun, not when they follow it. e.g. that cheese is strange smelling, not strange-smelling.

$^\dagger$ I suspect that focussing too narrowly on 'testing' a hypothesis probably comes from a frequentist statistics tradition of rejecting or failing to reject a null hypothesis. As Andrew Gelman pithily summarises: "I do model checking to test the model that I am fitting, usually not to test a straw-man null hypothesis. I already know my model is false, so I don’t pat myself on the back for finding problems with the fit (thus "rejecting" the model); rather, when I find problems with fit, this motivates improvement to the model." We should be investigating how the evidence discriminates between different possible answers, and knowing that one answer provides a poor explanation of the data does not mean we should reject it: logically, we must also show that at least one other plausible answer provides a good explanation of the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advice_for_writing_a_scientific_paper.MD

advice_for_writing_a_scientific_paper.MD

Advice on writing a scientific paper (in academia)

Use a question-driven approach

Structure your writing at the level of paragraphs

Other considerations

Files

advice_for_writing_a_scientific_paper.MD

Latest commit

History

advice_for_writing_a_scientific_paper.MD

File metadata and controls

Advice on writing a scientific paper (in academia)

Use a question-driven approach

Structure your writing at the level of paragraphs

Other considerations