Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It helps in identifying the most relevant words in a document by considering not just their frequency in a single document (Term Frequency), but also how unique the word is across all documents in the corpus (Inverse Document Frequency). TF-IDF is often used in information retrieval and text mining as a weighting factor during search and document similarity scoring processes. `scikit-learn` Here's a breakdown of how it works: ### 1. **Term Frequency (TF)** Term Frequency measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (the total number of terms in the document) as a way of normalization: ![[Pasted image 20240401121910.png]] ### 2. **Inverse Document Frequency (IDF)** Inverse Document Frequency measures how important a term is within the whole corpus. The idea behind IDF is that terms that appear in many different documents are less significant than terms that appear in a smaller subset of documents. IDF is calculated by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient: ![[Pasted image 20240401121750.png]] ### Combining TF and IDF The TF-IDF value is simply the multiplication of TF and IDF for a term in a document: ![[Pasted image 20240401121809.png]] This value is high for terms that are frequent in a small number of documents, thus offering high discriminative power to distinguish between documents. Conversely, the TF-IDF value is low for terms that are either very common across all documents or rare in all documents, as well as terms that are very rare in the specific document. ### Practical Example Consider a corpus with 1000 documents, and the word "Python" appears in 100 of these documents. Then, for a document that contains the word "Python" 3 times and has 100 words in total: ![[Pasted image 20240401121830.png]] This score helps in ranking "Python" according to its relative importance in the document within the context of the given corpus. TF-IDF is widely used in document classification, search engines, and information retrieval to rank how relevant a document is to a query term or to identify the most relevant terms within documents.