Hyvor Blogs has a cool SEO analyzer. It runs multiple checks on a post to give users feedback on how to improve their posts. One test is to check the words count. My first approach to get the word count was:
1const wordsCount = content.split(/\s+/).length
All the languages I know (English, Sinhala, and a bit of French) use spaces as a word separator. So, naturally, I believe that was enough. But, it was not for languages like Chinese, Japanese, Korean, etc.
After releasing the first version of the SEO analyzer, the first feedback we got was āHey! I love the new SEO analysis feature, but that does not work in Chineseā. Hyvor Blogs is a multi-language blogging platform, so it is critical to support non-space-boundary languages as well.
Intl.Segmenter!
I tried multiple ways to come up with a good solution. I played around with regex, Unicode ranges to detect characters in languages like Chinese, and many other things. Some worked but not to an acceptable extent.
Finally, I found the Intl.Segmenter web API buried inside the internet. None of the StackOverflow answers I consulted suggested this API. But, itās all I needed!
Firefox š”
The only issue with this API is that Firefox does not support it (yet). There is a Bugzilla thread opened 6 years ago, without much progress. So, there is no ETA yet on when this will be available on Firefox.
Get Words
1function getWords(content: string, languageCode: string) { 2 3 if (typeof Intl.Segmenter === "function") { 4 5 const segmenter = new Intl.Segmenter(getLanguageCode(languageCode), { granularity: 'word' }); 6 const iterator = segmenter.segment(content)[Symbol.iterator](); 7 8 let words = []; 9 for (const segment of iterator) {10 if (segment.isWordLike)11 words.push(segment.segment);12 }13 return words;14 15 } else {16 return content.split(/\s+/).filter(word => word !== "");17 }18 19}20 21function getLanguageCode(languageCode: string) {22 23 try {24 // @ts-ignore25 const codes = Intl.getCanonicalLocales(languageCode);26 return codes[0];27 } catch (err) {28 return undefined;29 }30 31}
Explanation
typeof Intl.Segmenter === "function"
makes sure the API is supported. Otherwise, we use split
as a fallback.
getLanguageCode()
makes sure the given language code is a valid one. Otherwise, it returns undefined
.
segmenter.segment
returns an iterator of segments that include words, whitespace, punctuation, etc. We use segment.isWordLike
to get words only.
Beware of this performance Issue
I froze my Mac doing this on a large document (10,000 words):
1const segments = [...segmenter.segment(content)];
This converts the iterator directly into an array. However, each segment object contains an input
key that saves the original input.
So, letās say
10,000 words
= 50,000 characters
= 50kb memory per
input
so, for all
input
, you will need 500MB of RAM for this array only
Donāt directly convert the iterator into an array if you are planning to work with large documents.
Get Words Counts
Now that we have a function to get the words, we can easily get the word count.
1function getWordsCount(content: string, language: string) {2 return getWords(content, language).length;3}
Bonus: Keyword Searching
Another part of the SEO analyzer is to check for keywords within content. Hereās how I used Intl.Segmenter to do that:
1function getKeywordOccurrences(content: string, keyword: string, language: string) { 2 3 const segmenter = new Intl.Segmenter( 4 getLanguageCode(language), 5 {granularity: "word"} 6 ); 7 8 let occurrences = 0; 9 10 // it is save to convert to an array directly, 11 // given the keyword is short12 const keywordWords = [...segmenter.segment(keyword)]13 .map(segment => segment.segment);14 15 const contentSegmentsIterator = segmenter.segment(content)[Symbol.iterator]();16 const contentWords : string[] = [];17 for (const segment of contentSegmentsIterator) {18 contentWords.push(segment.segment);19 }20 21 contentWords.forEach((contentWord, i) => {22 23 if (24 contentWord.toLowerCase() !== 25 keywordWords[0].toLowerCase()26 )27 return;28 29 let found = true;30 31 for (let j = 1; j < keywordWords.length; j++) {32 if (33 !contentWords[i + j] ||34 contentWords[i + j].toLowerCase() !== keywordWords[j].toLowerCase()35 ) {36 found = false;37 break;38 }39 }40 41 if (found)42 occurrences++;43 44 });45 46 return occurrences;47 48}
This counts the number of occurrences of a keyword in a string. It loops through each segment (word) in content. If one matches the first segment of the key, it checks if the next segments match the next segments in the keyword.
Conclusion
Itās amazing that we have all the tools we need to build an amazing API. Intl.Segmenter is powerful. I invite you to check the documentation on MDN to see other possibilities, such as sentence segmenting. If you have any feedback, feel free to comment below.
Comments