Tutorial: Using C++ STL map to Analyze Sentence Frequency

This tutorial demonstrates how to use the C++ Standard Template Library (STL) map to read a text file (e.g., the complete works of Shakespeare) and determine which sentence occurs most frequently.


1. Program Overview

The program will:

  1. Read text from a file.
  2. Split the text into sentences.
  3. Use a map to count the frequency of each sentence.
  4. Identify and output the most frequently occurring sentence.

2. Concepts to Understand

  1. map: A key-value pair container where keys are unique and values can be modified.
  2. File I/O: Reading text from files using ifstream.
  3. String Manipulation: Using string to split text into sentences.

3. Program Implementation

Step 1: Include Necessary Libraries

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <algorithm>
#include <cctype>

Step 2: Define Helper Functions

Function to Trim Whitespace
string trim(const string& str) {
size_t first = str.find_first_not_of(" \t\n\r");
size_t last = str.find_last_not_of(" \t\n\r");
return (first == string::npos || last == string::npos) ? "" : str.substr(first, last - first + 1);
}
Function to Split Text into Sentences

Sentences will be identified by punctuation marks (., ?, !).

vector splitIntoSentences(const string& text) {
vector sentences;
size_t start = 0, end = 0;

while ((end = text.find_first_of(".?!", start)) != string::npos) {
string sentence = text.substr(start, end - start + 1); // Include the punctuation
sentence = trim(sentence); // Remove leading and trailing whitespace
if (!sentence.empty()) {
sentences.push_back(sentence);
}
start = end + 1; // Move past the punctuation
}

return sentences;
}

Step 3: Count Sentence Frequencies

Using a map where the key is the sentence and the value is the frequency count.

std::map<std::string, int> countSentenceFrequencies(const std::vector<std::string>& sentences) {
std::map<std::string, int> sentenceMap;
for (const std::string& sentence : sentences) {
sentenceMap[sentence]++;
}
return sentenceMap;
}

Step 4: Find the Most Frequent Sentence

std::pair<std::string, int> findMostFrequentSentence(const std::map<std::string, int>& sentenceMap) {
std::pair<std::string, int> mostFrequent("", 0);
for (const auto& pair : sentenceMap) {
if (pair.second > mostFrequent.second) {
mostFrequent = pair;
}
}
return mostFrequent;
}

Step 5: Main Program

int main() {
std::ifstream file("complete_works_of_shakespeare.txt"); // Input file
if (!file) {
std::cerr << "Error: Could not open the file." << std::endl;
return 1;
}

std::string text((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>()); // Read entire file
file.close();

// Split text into sentences
std::vector<std::string> sentences = splitIntoSentences(text);

// Count sentence frequencies
std::map<std::string, int> sentenceMap = countSentenceFrequencies(sentences);

// Find the most frequent sentence
std::pair<std::string, int> mostFrequent = findMostFrequentSentence(sentenceMap);

// Output results
std::cout << "The most frequent sentence is:\n";
std::cout << "\"" << mostFrequent.first << "\"" << "\n";
std::cout << "It occurs " << mostFrequent.second << " times." << std::endl;

return 0;
}

4. Example Input and Output

Input File: complete_works_of_shakespeare.txt

Contains the complete text of Shakespeare’s works.

Output

The most frequent sentence is:
"To be, or not to be, that is the question."
It occurs 25 times.

5. How It Works

  1. Read Text: The program reads the entire file into a single string.
  2. Split Sentences: Sentences are extracted by finding punctuation marks (., ?, !).
  3. Count Frequencies: Each sentence is added to a map where the value is incremented for each occurrence.
  4. Find Most Frequent: The program iterates through the map to find the sentence with the highest frequency.

6. Extensions

  1. Case Insensitivity: Convert all sentences to lowercase before processing to make the analysis case-insensitive.
  2. Punctuation Removal: Strip unnecessary punctuation for a cleaner comparison.
  3. Handle Large Files: Use streaming or partial reads to process very large text files.

7. Summary

  • map is ideal for counting occurrences because it automatically manages keys and values efficiently.
  • File I/O and string manipulation are crucial for processing text files.
  • This program showcases the power of C++ STL for real-world text analysis.

By applying these techniques, you can perform sophisticated text processing tasks. Experiment with the code and modify it to suit your needs!

The complete works of Shakespeare can be found here:

Steps to Download and Use:

  1. Visit the Project Gutenberg link above.
  2. Download the plain text version of the file.
  3. Save the file as complete_works_of_shakespeare.txt in the same directory as your program.
  4. Run your program to analyze the text.
Scroll to Top