Find a professional editor in your field or genre, or in your language, with our Editors Directory.

IPEd

News – good, bad, fake? What corpora can tell us about language usage, with Claudia Koch-McQuillan
Transcribed and edited by Dr Pam Faulks AE and Dr Catherine Heath AE

At the recent meeting of Editors NSW, Claudia Koch-McQuillan introduced us to the world of corpora – large collections of language as it is used, marked up to support a wide range of searches – which are a relatively recent addition to the toolbox of anybody working with language. This talk showed how corpora can be used to check collocations, linguistic variations and concordances as well as to quickly and easily extract specialist terminology from a text or collection of texts to build up technical vocabulary. Corpora are a very useful addition to any language toolbox, and the resources used in this talk are freely available online.

Claudia Koch-McQuillan is a translator, interpreter and a former university lecturer and tutor in translation studies and technologies. (She once translated Angela Merkel for Barack Obama.) She finds language endlessly fascinating and enjoys corpora as tools that give unique insights into our practical use of language and which can provide solutions to otherwise intractable language problems.

Corpora are excellent tools to have in your toolbox for certain types of searches where you could not really use a dictionary, and so I hope you will find this talk on how to use corpora interesting and maybe even intriguing.

What are corpora, or what is a corpus?

Corpora are [database] collections of authentic natural texts, so they are not texts that were written specifically for a corpus. They can be newspaper texts; they can be TV show scripts – there is even a corpus of American soap operas; they can be academic English; they can be spoken texts. The texts are just collected in order to document and research language in use. Corpora are a fairly recent development because they are very data-intensive; so until about 15 or 20 years ago the computers were just not powerful enough to process vast amounts of data, whereas these days you find corpora with millions and millions of words and any searches are really fast.

Corpora, as I said, can be written and spoken texts. They are available in a number of languages; I think these days you would find corpora for most major languages. They can be in specific registers or genres – so novels, journalism, academic English. There are also, for instance, corpora for American English, British English, Australian English, African English, Jamaican English – if that is something of interest to you. Not all of them are available for free, but the ones I am talking about today are free ones. Corpora can also reflect a certain period of time, so, again, that could be of interest if you want to know if a particular expression was used at a certain point in time, or when it came into being.

Now, one thing that makes corpora so useful is that the texts in them are tagged by parts of speech – and, again, that is something that is very data-intensive, so that is why you need powerful computers. Every single word in these millions of words in, for instance, the corpus of American English is marked up with metadata saying ‘this is an adjective; this is a verb’ – even ‘this is a verb in infinitive form’ or ‘this is a verb in gerund form’, ‘this is a singular noun or a plural noun’, ‘this is a definite or an indefinite article’, and so on, and so forth. So, this meta-information allows you to do very specific searches for how language is used.

Why use corpora?

Now, why would you use corpora? Here are some reasons – I mean, apart from the fact that they are actually really intriguing and fun for language nerds:

  • Dictionaries are really useful, but they cannot necessarily compare usage or tell you much about collocations (groupings of certain words). So, say you were editing or working on a text in a particular technical field: a dictionary cannot tell you what sort of verb goes with (which other words) – say, it is about renewable energy: it cannot tell you about what verbs are used with, say, “solar panels”, such as “install”, although that is a fairly obvious one. It cannot tell you what sort of adjectives are used with particular nouns. Dictionaries can tell you combinations of verbs and prepositions, but that is about it. So, if you want to find collocations, corpora are the way to go.
  • Sometimes you have a phrase in front of you and you think, ‘this is not quite what I want to say’, ‘this is not quite the adjective I want’, and, yes, you can look for synonyms and you can go to a thesaurus, but you can also go to a corpus and, for instance, find a whole list of adjectives that are used with a particular noun. Depending on which corpus you use, you can even specify whether you want to search by positive or negative adjectives. Sometimes all that you need is a nudge, because you know that you are after something but you just cannot think of it and a corpus might tell you the information, or at least it will give you a whole range of relevant information that you can choose from.
  • And finally, if you use a corpus, it allows very unique searches – and it is science; it is data. Corpora are data from real, natural language and sometimes it is just good to be able to back up your option/s, whether as a translator or an editor, with proper data.

How to use corpora

Common search types in Google Books Ngram Viewer

[Claudia referred to sample research questions that were displayed for discussion in the meeting.]

So, the first question that I had put up was ‘What is more common [in written language], ‘good news’ or ‘bad news’, and what about ‘fake news’?’ and, for those of you who were here before we officially started, everybody said ‘good news’ and, yes, it is about twice as popular as ‘bad news’. And, if you look at this graph [referring to a slide], this is from the Google [Books] Ngram Viewer, which is one free corpus tool, and it just gives you the frequencies of words or phrases. You can specify the period of time. The corpus that Ngram Viewer uses is Google Books and unfortunately it only goes to about 2012, although I have a feeling that even after 2008 the data seems to be skewed. That is why, on this graph, we hardly see any ‘fake news’ because in 2008 fake news actually was not a thing – it is a very recent development, as we probably all know.

So, that is a fairly basic example, but you could ask, for instance, ‘Which is more commonly used, ‘in future’ or ‘in the future?’ This will be a very quick and easy search – you just separate the search terms by a comma, and I specified case insensitive, and [select] the smoothing – the higher the number of smoothing in the dropdown menu, the more even the lines are; if you do no smoothing, you get a lot of jagged lines. When I searched for ‘in future’ or ‘in the future’, ‘in the future’ is much more common these days, but the trajectory was different. Up until the mid- or late-19th century, ‘in future’ was the more common one, and in British English the version with the article became dominant in 1880; in American English that was in about 1865 – just a titbit of information for language nerds, how language changes differently in different countries.

So, the Ngram Viewer has the huge corpus of Google Books. Personally I only use it for very basic searches because, once you want to do more complex searches, it is not that intuitive and it is not as useful; if you want to be specific you need to learn a lot about the syntax of the search queries. The other corpus that I am going to be talking about is much easier so I tend to use that one. Google Books corpora are available for a number of languages, so you can specify American English, British English, and there are also corpora for French, German, Hebrew, Italian, Russian and Spanish if that is of interest to anybody.

Common search types in English-Corpora.org

Now, the second question that I put up was ‘unprecedented’ versus ‘unparalleled’: are there any differences between the two words?’ This is a search in the iWeb corpus at English-Corpora.org, a free corpus collection online that offers a number of corpora. You can access iWeb and the British National Corpus (BNC) through it, as well as the corpus of American English – Corpus of Contemporary American English (COCA) – and a number of other English language corpora. Unfortunately, the BNC is not a free corpus, so that is why the BNC data in English-Corpora.org are only available up to about 2000 (I think).

So, what I did here was, I just entered that I wanted to compare two words. You could also use two phrases. And it will give me the collocates – what the word ‘unprecedented’ is most commonly used with and what ‘unparalleled’ is most commonly used with – so I can see where there is a difference in their collocates and, as you see [on the slide], there is a lot of overlap in the first (the most common) collocates – which are ‘time’, ‘year’, ‘period’, ‘world’, ‘years’ – and the different colour markings that you see – the very bright green and the red – are where there is the biggest difference. So ‘system’ is only, or mainly, used with ‘unprecedented’; and ‘experience’, ‘quality’ and ‘feat’ are mainly used with ‘unparalleled’. So, it was said [in the discussion before the meeting], that ‘unparalleled’ might be more positive, so this would confirm the hunch that there is only a minor difference, but that ‘unparalleled’ tends to be more commonly used with positive nouns.

Then, the third question goes back to the news. These are adjectives used with the noun ‘news’, and if you look at the top [of the graph] it started from 1990 to 2004 and then from 2005 to 2019 – so about two 15-year periods – and the question was ‘If you see adjectives such as ‘Soviet’, ‘Iraqi’, [‘worldwide’] and ‘honoured’ [modifying ‘news’ in one period] versus ‘Syrian’, ‘Korean’, [‘provocative’] and ‘digital’ [modifying ‘news’ in another period], where would you find ‘fake’?’. Yes, it is in the second group with ‘digital’, the more recent group. The first one relates to the period up to 2004 and the second to the period from 2005 until now. In the period up to 2004, ‘fake’ did not exist as a modifier of ‘news’ and, as you can see [on the slide], I have grouped them by the number of occurrences and ‘fake’ actually rose right to the top in the last group. You cannot see that in this extract, but the search gives you the 100 most common adjectives, but that was too big to show, so I have just chosen the first 20 so they are still legible.

Right at the bottom of that top-100 list, and with a huge drop, were modifiers such as ‘mainstream’, ‘huge’, ‘excellent’ and ‘exciting’, that suggests to me, how in general, the regard in which news is held has dropped. So, negative modifiers have been on the rise and positive ones have been on the way out, which is maybe not of practical information, but it is an interesting observation. Similarly, if I were to write a text as a translator, or if I were to edit a text, and wanted to look for an adjective to use with the noun ‘news’, I would have a fairly good option/s among the ones here, especially if I go on to the first 100. I could also set the search to include even more results than 100. And if anybody noticed the modifier ‘nerdist’ in position 15 on the right-hand side, I was really intrigued, but Nerdist News is actually a website that delivers entertainment news – I had to find out what it was. If you click on any of these adjectives, it will take you to a number of sentences where [that modifier] is used. Say, if I click on ‘nerdist’, it would then take me to a window where I get about 100 occurrences of ‘nerdist news’. In a different type of search, I could also click a number of adjectives and get all the keywords in context displayed.

So, let us go to a different type of search. This is actually the start page of English-Corpora.org [referring to a slide] and the default corpus it uses is the Corpus of Contemporary American English. [The website can be used a number of times for free, and you also get prompts to register.] You have to register, even for free access. I have registered and just last week decided I was going to pay for this – it is $30 a year for the subscription and I thought I do actually appreciate that this resource is available. This [referring to a slide] is the interface that you see – it does not change whether you have a paid or an unpaid account.

These are the different search types:

  • The collocate search is a very useful one that I have already talked about. PoS here stands for ‘part of speech’. If you look at the search text box [referring to the slide], I have got ‘news’ in there as the word or phrase that I am interested in, and that was the search that I used to compare the adjectives used previously. Then I enter into the collocates box that I want to know what sort of adjectives go with the noun ‘news’. You do not have to know [the search syntax] that ‘J*’ stands for ‘adjectives’; it will give you the option if you click in the box. As I want the collocate adjective and, because I know that, in English, adjectives are to the left of the noun generally, using the numbers underneath (0, 1, 2, 3, 4 and +) I can specify a certain number of words where I want to find that adjective, to be found in that range. So, I have used just ‘1’, as I want only the adjective that immediately precedes the word ‘news’, because I wanted to limit it to that. You could also, if you wanted a preposition to go with a verb, specify the part of speech preposition and you would probably specify something to the right of the word. That, again, usefully narrows down your search. And then, at the bottom, there are sections (that are optional), you can specify a date range (only available for the American English corpus), you can specify certain genres (academic English, journalism, and so on). There are also different options for the section searches, such as newspapers, magazines and so on.
  • Another option, next to sections, is text/virtual. That is if you want to create your own corpus, you can actually do that in iWeb. For instance, as a translator – or an editor – you want to work in a new technical field and want to use the corpora to get a feeling for the language that is used in that field. If you decide to create a virtual corpus, you can then enter a search word or a range of search words and iWeb will offer you lots and lots of webpages – websites – from which to extract text and you just click on the ones you want. You can also specify the genre and then iWeb collects all these texts, extracts them, marks them up – I have no idea how – and presents them to you as a corpus that you can do the same sort of searches in, for collocates or phrasal verbs, and so on, as you would in the other corpus. I think that is a really useful tool. For instance, you could also – something that I have done as a translator in a new field – search for keywords. Keywords are words that are relatively uncommon, so they tend to be specialist words. So if I create a corpus for the search word, say ‘tractor’ or ‘agricultural machinery’ (which is a field that I happen to work in sometimes), and then extract the rare words in there, I get a very good collection of specialist terms in that field. It might not be complete, but it is a good start. So that is another way you can use a corpus, just to get a feel for technical terminology in a particular area.
  • [With] sort/limit, to the right of texts and virtual, you can decide if you want the results to be sorted by relevance (so in the comparison that would be where the biggest shifts have been), by frequency (how often a word or an expression occurs in the whole corpus) or by alphabetical order. And in the options, you can limit how many results you want to have displayed, and you can decide what you want to group the results by and set various limits.

As I said at the beginning, dictionaries give some answers, but different from a corpus, and the iWeb corpus also has another function in the way it displays information about a word. I looked up the Oxford Learner’s Dictionary to see what it had to say about ‘interaction’ and it gave two definitions that are fairly obvious [referring to a slide showing two definitions using the collocates ‘between’ and ‘with’]. If I then go on the iWeb corpus word search page, the information that you get includes grammatical information: you see that the word ‘interaction’ is a noun; where it is most commonly found, for example, it is common in blogs, on the web and, above all, in academic English; a definition; hear it pronounced; get translations (I have not tried that), get synonyms and topics where ‘interaction’ often appears and you can click on any of the topics and then find clusters of information about these words. You immediately get a selection of collocates that are commonly found with ‘interaction’. And then you see the so-called clusters (grammatical correlations that ‘interaction’ is often found with) and also displayed are concordance lines – and there are at least 50 or maybe 100 concordance lines where we see how ‘interaction’ or ‘interactions’ is used in texts. At the bottom of these concordance lines, to the left, you can see the sentence, and you can expand that and see where and what sort of text it comes from. It is academic English, the first one from a journal called Physics Today and the second one is from a magazine called Parenting. So you get a whole wealth of information about that particular word and, again, sometimes when you are stuck, it frees up your mind to go off in different directions that might just give you that inspiration that you need to find what you were after.

More searches

So, this is a quick overview of other searches that you can do that I have not gone into in detail here:

  • You can do a list search, where you just display how common a word is. Here I have done a search for the synonyms of ‘beautiful’. The results are the most common adjectives that express the concept of ‘beautiful’: ‘lovely’, ‘attractive’, ‘striking’. It takes about five seconds to do the search and you get additional information that you would not get from a thesaurus or a dictionary.
  • You can search for expressions. For instance, if I enter ‘more’ and an asterisk and then ‘than’ – the asterisk stands for one word – I find all expressions that have ‘more’ and one other word and then ‘than’: so, ‘more important than’, ‘more frequent than’, ‘more common than’, ‘more unusual than’, and so on.
  • Or I can create lists of grammatically or thematically related words.
  • The find random words search is a fun search – it is one of the options in browse from the first screen.You can search for different word forms, but you can also search for rhymes, or you can search for words that have a certain number of syllables or that are stressed on the third syllable. So, very unusual searches where, again, you would struggle to find that information in a dictionary.
  • And the KWIC search, keyword in context, where you see the word that you are searching for and you see the sentence around it, as well, and you see where it comes from.

Snares and benefits of corpora

So, that was all the fun things about corpora, but they can be quite time-consuming and, above all, I find corpus searching a great distraction because I always think ‘What happens if I do this? … What happens if I try that? … Oh, look at that, is not that interesting!’ and half an hour later I have learnt all sorts of interesting language titbits but maybe not really got on with my work.

Something to be aware of is that corpora data can be skewed or obsolete. For instance, different corpus have data from varying dates. So, while you can compare the data, you might find results that do not actually reflect the desired or current situation. Or, if you go to different corpora in different languages, for instance, or from other sources, make sure that you find out where, and when, the data came from. Check the source – is it an evenly spread-out source, or if, say, you want academic English, where does the data come from? If you want a generic corpus, make sure that it is not one that takes text just from magazines, for instance. So, be aware of where the texts are taken from in the corpus.

But I think corpora are just a good tool to have for certain searches. It is good to be aware of them so you can pull them out of your toolbox whenever you know that is a question that you cannot answer anywhere else and you are happy to spend a bit of time – especially when you are not used to corpus searches. So, it is just a useful thing to have in your toolkit, and I hope that you will agree with me and have found this talk a useful insight.

Resources: free corpora

The English-Corpora.org site contains the iWeb (a massive corpora of online text with 14 billion words), the Corpus of Contemporary American English (COCA), the British National Corpus (BNC) (data only to 1993), and several other corpora on different subjects.

Some are downloadable and others are online access. You can choose which ones would best suit your searches. There is a very helpful overview (mainly about iWeb) at corpus.byu.edu/iweb/help/iweb_overview.pdf.

Another free corpus is Google Books Corpus based on the Google Ngram Viewer that I briefly mentioned at the beginning. That gets fairly complex, but also has a reasonably helpful explanation of the searches that you can do.

And then the Tools for Corpus Linguistics site has a long list of free and low-cost corpus tools and concordance tools, if you want to experiment with any of them.

Additional discussion

Q. Do you think there are more ‘activists’ or ‘environmentalists’ in written language? Have a guess when these words started coming into common usage and when they had their heyday, between 1900 and 2008

A. From the meeting group: ‘activist’, last millennium, and ‘environmentalist’, this millennium.

CK-McQ: That is actually a very good answer. ‘Activist’ is much more common and also much broader – you can be an activist on all sorts of things, including environmentalism. ‘Activist’ has had a pretty stellar career since the mid- to late-1960s and ‘environmentalist’ only got going in about the mid-1970s, but then only really took off in the last 10 years.

Q. More on the usage of ‘good news’, ‘bad news’ and ‘fake news’:

A. One thing I wanted to show you: [in this slide, you see that] in the mid-1940s both the ‘good news’ and the ‘bad news’ have a bit of an uptick, but the ‘good news’ is actually just a bit later, so that would again correlate with world events at that time – although I find it interesting that all through World War I there is not much movement compared to World War II. I find it really intriguing that the ‘good news’ went down slightly, but the “bad news” was pretty steady in World War I.

Q. What are people ‘happy’ about in American and British English? Guess who is more likely to be ‘happy’ about something generally and whether there is likely to be a minor or major difference between American English and British English?

A. From the  meeting group: As an English person I would guess that the British are more likely to be ‘pleased’ than ‘happy’ because we do not like to be too effusive. We might be ‘mildly energised’.

CK-McQ: That would be an interesting search to do, to include ‘pleased’. Yes, you are right. Americans write a lot more about – and this was from written English – they write more about being ‘happy’ but then, of course, there are a lot more Americans than there are British English speakers. That is also something to take into account when you look at this sort of data. And the only thing British English was commonly ‘happy’ about is ‘a way’ of something, followed by ‘situations’, ‘ideas’ and ‘arrangements’. The first person to make an appearance in the list was ‘wife’ in place 7. And ‘children’, interestingly enough, were way down the list in both American and British English. And American English is ‘happy’ about ‘facts’, ‘situations’, ‘decisions’ and ‘ways’, but also about ‘news’, ‘things’ and ‘ideas’, and the first person in American English to make an appearance is a ‘baby’, in place 8. Again, I am not sure what it tells you, if anything. You probably need a lot more information, but it is just intriguing sometimes to look at these differences.

I went to a presentation not long ago about corpora of different Englishes – world Englishes. Some information that I found really interesting and intriguing was about the use of swear words in different Englishes. That presenter divided Englishes as the core (which is American and British English), then the second circle (which was Australian, Canadian, Indian – the colonial Englishes), and then users further away from the native speakers (what you would call ‘world English’, in China and Europe) and the further you are away from the centre, the more reluctant people are to use informal language and swear words. So, the native speakers are the most informal with their use of language; whereas, for instance, with Indian English, people are very reluctant to use informal English or swear words. If you are working in this field – for instance, if you had a novel where you had some speakers of different world Englishes – that is very useful information to have. What sort of English is used by Jamaican English speakers, or Kenyan English speakers?

Q. Could ‘in future’ and ‘in the future’ evoke different meanings?

A. Possibly. I did not do a search for that. If I did a search for these two phrases, and did a compare search, I would then see the frequency of each of these phrases. If I then clicked on the display keyword in context I would get a whole range of sentences that have either ‘in future’ or ‘in the future’. I could then look through them and see whether there is any difference in the context or the meaning, or in the genre or in the source, because for each of the sentences I would see where it comes from – maybe not the individual publication – but whether it is academic text, journalism text, whether it is news, whether it is a novel, and so on and so forth. So, yes, you could find out more about it. The iWeb corpus has a reasonably good help system and one feature I like about the iWeb corpus, apart from that it is free and up-to-date and huge, they have sample searches where you just click on the sample and see what happens, so that has often been useful.

Q. Is there a difference between ‘corpora’ and ‘concordances’, such as concordancing software, and how editors could use them?

A. Concordance searches are a part of corpus searches and there are concordance tools around. [I can suggest] a web page mentioned earlier, full of links to different language analysis tools, most of them free or very low cost, and there are some concordance tools there: https://corpus-analysis.com/. You would need a corpus [on which] to do a concordance search because you have to have a collection of texts. I am not sure whether, in a concordance search, you have the part of speech-tagging facility that allows you to specify ‘I want an adjective to go with this noun’, or ‘I want to know the preposition to go with this verb’, and so on. So there is some overlap, but I do not think they are the same.

Q. Are there Australian sites equivalent to the US one demonstrated here or sites that you can search using Australian English?

A. There is the Australian National Corpus (AusNC). It is not available for free, so that is why I have not included it here. You could probably create your own corpus for a particular area. I am not quite sure what the access conditions are for the Australian National Corpus – whether it is a paid subscription or whether you need to be a member. Maybe IPEd could have access to it – I am not sure – but I am not aware of a free resource for Australian English, unfortunately.

Q. I work with British English a lot. When I have tried to use Google Ngram, British English 2012 corpus, it brings up a lot of US English results and I can see from the book publication details that they are US publications. It’s basically useless. Do you have any experience of this or know why this happens?

A. I do not have a lot of experience with the Google Ngram Viewer because I find the searches are quite limited – and that is why I tend to use the iWeb corpus, because I think it is more specific. The only unfortunate thing is that the British English is quite old in it. A lot of the Google Books corpus that Ngram Viewer uses is based on optical character recognition, so I think it is less specific and it is much more difficult to narrow it down because it just uses the huge amount of words in Google Books and puts them all together. So, I think it is useful for basic searches, quick searches. You might be better off doing more specific Google searches. For instance, if you put a search phrase into quotation marks it will search for that exact phrase; and then if you add at the end ‘site: UK’, it will only search in sites that have a UK URL. That might be more specific than using the Ngram Viewer. Or you can use ‘site: [company URL]’ to search in a particular company’s site, if that is useful for you, perhaps to see if a company uses one word rather than another in a specific context. You could look at how often a particular phrase or word comes up on that company’s site. That might be a better option and it is also a type of corpus search that is very quick and easy to do. It is probably less scientific and you cannot do the detailed searches by parts of speech, but if you are just after a particular phrase or expression or word, then the ‘site’ command in Google is very useful.

Q. You can do a corpus linguistics course through a MOOC [massive open online course] run through Lancaster University – Corpus Linguistics: method, analysis, interpretation.

A. Leeds University also has some free corpora available online and yes, a MOOC course would be great if you have time at the moment … It is always fun to talk about language.