The TF*IDF Algorithm Explained

TF-IDF (term frequency-inverse document frequency) is an information retrieval technique that helps find the most relevant documents corresponding to a given query.

TF is a measure of how often a phrase appears in a document, and IDF is about how important that phrase is. The multiplication of these two scores makes up a TF-IDF score.

Google has been using TF-IDF (or TF  IDF, TF*IDF, TFIDF, TF.IDF) to rank your content for a long time. It seems that Google focuses more on term frequency rather than on counting keywords. 

So, it’s important to wrap your head around the TF-IDF algorithm and how it works.

Whether you’re a content writer or an SEO expert, this article will guide you through the unknown topic of TF-IDF.

By understanding how Google utilizes this algorithm, content writers can reverse-engineer TF-IDF and optimize content for both users and search engines.

And SEOs can use it as a tool for hunting keywords with a high search volume and comparatively low competition.

Keep in mind that TF-IDF only covers one facet of content optimization, which itself is only a part of SEO. Contact us for a thorough technical SEO audit of your website.

TF-IDF simple explanation

TF-IDF is used by search engines to better understand the content that is undervalued. For example, when you search for “Coke” on Google, Google may use TF-IDF to figure out if a page titled “COKE” is about:

  1. a) Coca-Cola.
  2. b) Cocaine.
  3. c) A solid, carbon-rich residue derived from the distillation of crude oil.
  4. d) A county in Texas.

What is TF-IDF?

The TF-IDF algorithm is used to weigh a keyword in any content and assign importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus.

For a term t in document d, the weight Wt,d of term t in document d is given by:

Wt,d = TFt,d log (N/DFt)

Where:

  • TFt,d is the number of occurrences of t in document d.
  • DFt is the number of documents containing the term t.
  • N is the total number of documents in the corpus.

All right. Don’t panic if you feel a headache coming on.

Let’s define this more concretely.

How is the TF-IDF score calculated?

TF-IDF is scored between 0 and 1. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term

Let’s look at an example of a TF-IDF calculation. 

TF (term frequency) example

The TF (term frequency) of a word is the frequency of a word (i.e., number of times it appears) in a document. When you know TF, you’re able to see if you’re using a term too much or too little.

When a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0.12

IDF (inverse document frequency) example

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus (a body of documents).

Let’s say the size of the corpus is 10,000,000 million documents. If we assume there are 0.3 million documents that contain the term “cat”, then the IDF (i.e. log {DF}) is given by the total number of documents (10,000,000) divided by the number of documents containing the term “cat” (300,000).

IDF (cat) = log (10,000,000/300,000) = 1.52

TF-IDF Calculation

Put the TF and IDF calculations together to get a TF IDF score.

∴ Wcat = (TF*IDF) cat = 0.12 * 1.52 = 0.182

A TF-IDF score of 0.182 is much closer to 0 than 1. This suggests that “cat” is a common term with less weight. 

Now that you have this figured out (right?), let’s look at how this can benefit you.

How you can benefit from using TF-IDF

Gather words. Write your content. Run a TF-IDF report for your words and get their weights. 

Compare all the terms with high TF-IDF weights with respect to their search volumes on the web. Select those with higher search volumes and lower competition. Work smart.

A good rule of thumb is the more your content “makes sense” to the user, the more weight it is assigned by the search engine. With words having a high TF-IDF weight in your content, your content will always be among the top search results, so you can:

  • stop worrying about using the stop-words,
  • successfully hunt words with higher search volumes and lower competition,
  • be sure to have words that make your content unique and relevant to the user, etc.

Let’s get you up and rolling with TF-IDF optimization

1. Register your account at Ryte.

First, you will need to sign up for Ryte.com. Ryte lets you optimize your content using TF-IDF! Check out the free 10-day trial! With a very simple user interface, it is one of the best options for you or your content writer. Your content writer can create their own account and start work within minutes.

You will see Start your Trial today in the middle of the screen. Select this to start signing up. 

what-is-tf-idf - what is tf idf 1

You can either sign up with Google or your email.

what-is-tf-idf - what is tf idf 2

Confirm your email address, and you’ll get to an organization information page.

what-is-tf-idf - what is tf idf 3

Complete all four steps to finish signing up. Then you’re ready to get started on analysis!

2. Click on Content Success on the left side, Analyze, and then New Analysis.

And then, you will come to this page:

what-is-tf-idf - what is tf idf 4

Once you are there, you simply need to choose your keyword, the language, region, and country you are interested in, and click Get keyword recommendations.

After some time, you will be presented with keyword recommendations. In this case, we put in the keyword “tf-idf” for English in the United States:

what-is-tf-idf - what is tf idf 5

If you scroll down, you can see the keyword relevancy percentage.

what-is-tf-idf - what is tf idf 6

As you can see, this is a lot of raw data that would be hard to use. Fortunately, we can do cool stuff to make it easier to understand and utilize.

Your options include Detail mode, Competition, and Compare.

And the last one is what we want to focus on.  It’s highlighted in red in the following screenshot.

what-is-tf-idf - what is tf idf 7

When you click Compare, you’ll see an Add URL button.

3. Add URL is where all the magic happens. Let’s check it out in more detail.

We are going to compare the results with this article: https://www.onely.com/blog/what-is-tf-idf/ (apologies for getting all meta with this thing!).

what-is-tf-idf - what is tf idf 8

Ryte will show you the most relevant and popular keywords that you can integrate into your article. 

what-is-tf-idf - what is tf idf 9

If you want a percentage breakdown, just scroll down to Keyword relevancy

But there’s more! There’s also an option to filter the keywords (highlighted in red).

what-is-tf-idf - what is tf idf 10

You can filter for Report type, Word type, and Keyword popularity. Also, if you need, you can Ignore a keyword

what-is-tf-idf - what is tf idf 11

Now that you know how to filter the data let’s finally start optimizing our content.

Comparing and optimizing our content

Using the information above, you can determine the keywords not present in your content that are closely related to the topic. And adding those keywords to your content will improve the topic relevancy and help your page rank better.

How can you edit your content now to improve your TF-IDF measured relevancy? The easiest way to do that is to go to Optimize and click Content Editor.

what-is-tf-idf - what is tf idf 12

Once there, all you have to do is paste your content and click Get keyword recommendations. It’ll prompt you for a focus keyword, and then voila:

what-is-tf-idf - what is tf idf final

Now it all gets extremely easy. All you have to do is edit your content until you are happy with it.

And that should be more than enough to get you started. See, that wasn’t so bad, was it?

Good luck, and happy optimizing with TF-IDF!