Javascript: How to count words in a multi-language safe way?

Hyvor Blogs has a cool SEO analyzer. It runs multiple checks on a post to give users feedback on how to improve their posts. One test is to check the words count. My first approach to get the word count was:

1const wordsCount = content.split(/\s+/).length

All the languages I know (English, Sinhala, and a bit of French) use spaces as a word separator. So, naturally, I believe that was enough. But, it was not for languages like Chinese, Japanese, Korean, etc.

After releasing the first version of the SEO analyzer, the first feedback we got was ā€œHey! I love the new SEO analysis feature, but that does not work in Chineseā€. Hyvor Blogs is a multi-language blogging platform, so it is critical to support non-space-boundary languages as well.

Intl.Segmenter!

I tried multiple ways to come up with a good solution. I played around with regex, Unicode ranges to detect characters in languages like Chinese, and many other things. Some worked but not to an acceptable extent.

Finally, I found the Intl.Segmenter web API buried inside the internet. None of the StackOverflow answers I consulted suggested this API. But, itā€™s all I needed!

Firefox šŸ˜”

The only issue with this API is that Firefox does not support it (yet). There is a Bugzilla thread opened 6 years ago, without much progress. So, there is no ETA yet on when this will be available on Firefox.

Get Words

1function getWords(content: string, languageCode: string) {
2
3 if (typeof Intl.Segmenter === "function") {
4
5 const segmenter = new Intl.Segmenter(getLanguageCode(languageCode), { granularity: 'word' });
6 const iterator = segmenter.segment(content)[Symbol.iterator]();
7
8 let words = [];
9 for (const segment of iterator) {
10 if (segment.isWordLike)
11 words.push(segment.segment);
12 }
13 return words;
14
15 } else {
16 return content.split(/\s+/).filter(word => word !== "");
17 }
18
19}
20
21function getLanguageCode(languageCode: string) {
22
23 try {
24 // @ts-ignore
25 const codes = Intl.getCanonicalLocales(languageCode);
26 return codes[0];
27 } catch (err) {
28 return undefined;
29 }
30
31}

Explanation

typeof Intl.Segmenter === "function" makes sure the API is supported. Otherwise, we use split as a fallback.

getLanguageCode() makes sure the given language code is a valid one. Otherwise, it returns undefined.

segmenter.segment returns an iterator of segments that include words, whitespace, punctuation, etc. We use segment.isWordLike to get words only.

Beware of this performance Issue

I froze my Mac doing this on a large document (10,000 words):

1const segments = [...segmenter.segment(content)];

This converts the iterator directly into an array. However, each segment object contains an input key that saves the original input.

So, letā€™s say

  • 10,000 words

  • = 50,000 characters

  • = 50kb memory per input

  • so, for all input, you will need 500MB of RAM for this array only

Donā€™t directly convert the iterator into an array if you are planning to work with large documents.

Get Words Counts

Now that we have a function to get the words, we can easily get the word count.

1function getWordsCount(content: string, language: string) {
2 return getWords(content, language).length;
3}

Bonus: Keyword Searching

Another part of the SEO analyzer is to check for keywords within content. Hereā€™s how I used Intl.Segmenter to do that:

1function getKeywordOccurrences(content: string, keyword: string, language: string) {
2
3 const segmenter = new Intl.Segmenter(
4 getLanguageCode(language),
5 {granularity: "word"}
6 );
7
8 let occurrences = 0;
9
10 // it is save to convert to an array directly,
11 // given the keyword is short
12 const keywordWords = [...segmenter.segment(keyword)]
13 .map(segment => segment.segment);
14
15 const contentSegmentsIterator = segmenter.segment(content)[Symbol.iterator]();
16 const contentWords : string[] = [];
17 for (const segment of contentSegmentsIterator) {
18 contentWords.push(segment.segment);
19 }
20
21 contentWords.forEach((contentWord, i) => {
22
23 if (
24 contentWord.toLowerCase() !==
25 keywordWords[0].toLowerCase()
26 )
27 return;
28
29 let found = true;
30
31 for (let j = 1; j < keywordWords.length; j++) {
32 if (
33 !contentWords[i + j] ||
34 contentWords[i + j].toLowerCase() !== keywordWords[j].toLowerCase()
35 ) {
36 found = false;
37 break;
38 }
39 }
40
41 if (found)
42 occurrences++;
43
44 });
45
46 return occurrences;
47
48}

This counts the number of occurrences of a keyword in a string. It loops through each segment (word) in content. If one matches the first segment of the key, it checks if the next segments match the next segments in the keyword.

Conclusion

Itā€™s amazing that we have all the tools we need to build an amazing API. Intl.Segmenter is powerful. I invite you to check the documentation on MDN to see other possibilities, such as sentence segmenting. If you have any feedback, feel free to comment below.

Newsletter

Comments