Word Frequency Pooled Calculation
Below is a detailed overview of how Dedoose calculates the top 40 most frequent words in your selected dataset using a pooled calculation.
- A set of ‘Stop Words’ are excluded from all calculations. A ‘stop word’ list is a set of commonly used words that do not carry much meaning alone and out of context, such as articles and prepositions. Words on these lists are intentionally excluded from word frequency analysis to allow more descriptive terms to emerge. The stop word list implemented in Dedoose is relatively broad and has proven most effective for analytic purposes.
- Following exclusion of stop words, the 40 most frequently occurring words are calculated for each of the documents included in the search corpus at an individual level and added to a running frequency database for each word. This creates two collections:
- A database dictionary to track the total count of each unique word across all documents
- A list to store individual word counts per document
- Dedoose presents the final 40 words that had the highest frequency across the entire dataset and/or search corpus. In other words, the words presented are those most frequently observed at higher rates across all documents, and not those simply occurring at high rates within one document. This approach is intended to capture words most frequently occurring across the dataset to maximize the value of aggregate level findings.
Please note that if you apply a dataset filter to isolate a specific set of media files, Dedoose will not re-calculate the top 40 words, but rather will present the top 40 data just pertaining to those documents.
The Word Frequency chart works well for tracking very popular words, but might miss "mid-frequency" words that appear consistently across many documents but never make it into any individual document's top 40 list.
The Stop Word list is attached to this guide and downloadable.