This tutorial demonstrates how to use the C++ Standard Template Library (STL) map to read a text file (e.g., the complete works of Shakespeare) and determine which sentence occurs most frequently.
1. Program Overview
The program will:
- Read text from a file.
- Split the text into sentences.
- Use a
mapto count the frequency of each sentence. - Identify and output the most frequently occurring sentence.
2. Concepts to Understand
map: A key-value pair container where keys are unique and values can be modified.- File I/O: Reading text from files using
ifstream. - String Manipulation: Using
stringto split text into sentences.
3. Program Implementation
Step 1: Include Necessary Libraries
#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <algorithm>
#include <cctype>
Step 2: Define Helper Functions
Function to Trim Whitespace
string trim(const string& str) {
size_t first = str.find_first_not_of(" \t\n\r");
size_t last = str.find_last_not_of(" \t\n\r");
return (first == string::npos || last == string::npos) ? "" : str.substr(first, last - first + 1);
}
Function to Split Text into Sentences
Sentences will be identified by punctuation marks (., ?, !).
vectorsplitIntoSentences(const string& text) {
vectorsentences;
size_t start = 0, end = 0;
while ((end = text.find_first_of(".?!", start)) != string::npos) {
string sentence = text.substr(start, end - start + 1); // Include the punctuation
sentence = trim(sentence); // Remove leading and trailing whitespace
if (!sentence.empty()) {
sentences.push_back(sentence);
}
start = end + 1; // Move past the punctuation
}
return sentences;
}
Step 3: Count Sentence Frequencies
Using a map where the key is the sentence and the value is the frequency count.
std::map<std::string, int> countSentenceFrequencies(const std::vector<std::string>& sentences) {
std::map<std::string, int> sentenceMap;
for (const std::string& sentence : sentences) {
sentenceMap[sentence]++;
}
return sentenceMap;
}
Step 4: Find the Most Frequent Sentence
std::pair<std::string, int> findMostFrequentSentence(const std::map<std::string, int>& sentenceMap) {
std::pair<std::string, int> mostFrequent("", 0);
for (const auto& pair : sentenceMap) {
if (pair.second > mostFrequent.second) {
mostFrequent = pair;
}
}
return mostFrequent;
}
Step 5: Main Program
int main() {
std::ifstream file("complete_works_of_shakespeare.txt"); // Input file
if (!file) {
std::cerr << "Error: Could not open the file." << std::endl;
return 1;
}
std::string text((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>()); // Read entire file
file.close();
// Split text into sentences
std::vector<std::string> sentences = splitIntoSentences(text);
// Count sentence frequencies
std::map<std::string, int> sentenceMap = countSentenceFrequencies(sentences);
// Find the most frequent sentence
std::pair<std::string, int> mostFrequent = findMostFrequentSentence(sentenceMap);
// Output results
std::cout << "The most frequent sentence is:\n";
std::cout << "\"" << mostFrequent.first << "\"" << "\n";
std::cout << "It occurs " << mostFrequent.second << " times." << std::endl;
return 0;
}
4. Example Input and Output
Input File: complete_works_of_shakespeare.txt
Contains the complete text of Shakespeare’s works.
Output
The most frequent sentence is:
"To be, or not to be, that is the question."
It occurs 25 times.
5. How It Works
- Read Text: The program reads the entire file into a single string.
- Split Sentences: Sentences are extracted by finding punctuation marks (
.,?,!). - Count Frequencies: Each sentence is added to a
mapwhere the value is incremented for each occurrence. - Find Most Frequent: The program iterates through the map to find the sentence with the highest frequency.
6. Extensions
- Case Insensitivity: Convert all sentences to lowercase before processing to make the analysis case-insensitive.
- Punctuation Removal: Strip unnecessary punctuation for a cleaner comparison.
- Handle Large Files: Use streaming or partial reads to process very large text files.
7. Summary
mapis ideal for counting occurrences because it automatically manages keys and values efficiently.- File I/O and string manipulation are crucial for processing text files.
- This program showcases the power of C++ STL for real-world text analysis.
By applying these techniques, you can perform sophisticated text processing tasks. Experiment with the code and modify it to suit your needs!
The complete works of Shakespeare can be found here:
- Project Gutenberg: Complete Works of Shakespeare
- MIT Shakespeare Archive: The Complete Works of William Shakespeare
Steps to Download and Use:
- Visit the Project Gutenberg link above.
- Download the plain text version of the file.
- Save the file as
complete_works_of_shakespeare.txtin the same directory as your program. - Run your program to analyze the text.
