Forem Creators and Builders 🌱

Daniel Uber
Daniel Uber

Posted on

Changelog: Better Support for Your Language in Tags

A previous PR added strict frontend and backend validation for tags that disallowed accented characters.

We're relaxing this restriction so that tags with non-ascii strings can be used.

For example, if you're publishing content in Spanish and want to make that more discoverable, you can tag the article with Español (this was an existing tag which was being used - you can see articles at https://dev.to/t/espa%C3%B1ol - but could not be added to new articles so all the content is from May 2020 or earlier).

A Portuguese speaking user recently reported issues importing articles from an external blog. This PR should also reduce errors importing articles from RSS Feeds when categories are labelled in non-ascii characters.

We still won't permit emoji or symbols in tag names, only words and letters. We tested against Polish, Chinese, and Arabic words to get a fair sample of different languages and scripts without an issue when adding this change.

Top comments (6)

Collapse
 
ellativity profile image
Ella (she/her/elle)

Paging @9comindia and @yheuhtozr because you've both made some really great suggestions regarding i18n in the past.

We're making some small steps, and we hope you'll find them helpful.

Collapse
 
yheuhtozr profile image
yheuhtozr • Edited

Thank you for letting me know! This seems a great step towards a multilingual site. I'll definitely take time testing in this weekend, but with a quick glance, the current implementation ([[:alnum:]]) looks reasonably solid 👍, except needing some nitpick-level adjustments to cope with the real world:

unicode.org/reports/tr31/#Specific...

  • add U+00B7: multiple Iberian languages require it for some words
  • add U+05F3, U+05F4: Hebrew requires it for some words
  • add U+0F0B: no Tibetan multi-syllable word can be spelled without this
  • add U+200C, U+200D: most Indian & some Arabic languages require it for some words
  • add U+30FB: Japanese requires it for some words
  • some languages need hyphens and apostrophes as a part of spelling, but I have no idea how much their speakers think it legible without those signs
On the other hand, despite OP saying "[w]e still won't permit emoji or symbols in tag names, only words and letters", I can come up with some emojis which escape it (it could be tricky, so leave it up to you whether to fix this hole).
/[[:alnum:]]+/ === "0️⃣" # true
Enter fullscreen mode Exit fullscreen mode

Sorry that this test sample is invalid.

Collapse
 
djuber profile image
Daniel Uber • Edited

Thanks for testing this out, and the additional information about missing character support - I knew the "middle dot" is used in Catalan to separate some double consonants (like in "col·laborar"), and I wouldn't be surprised if there were other Iberian languages using it in a similar manner.

Initially I was trying to test support the same way you demonstrated and misled myself.

/[[:alnum:]]/ === "Test™"
=> true
/\A[[:alnum:]]+\z/ === "Test™"
=> false
Enter fullscreen mode Exit fullscreen mode

We do currently exclude "col·laborar" as a tag name (tested and verified your recommended joiner/modifier character support is lacking). That doesn't look like it's either necessary or intentional, I'll move that into another issue to add that.

I would generally prefer to err on the side of safely accepting too much than to be more restrictive than necessary, I'm not sure whether it's important to restrict the keycaps or variant modifiers you showed in the "key 0" example.

Thread Thread
 
yheuhtozr profile image
yheuhtozr • Edited

Thank you for the extensive test. At least AFAIK, it'll be sufficient if middle dots can come anywhere other than word-initial for European languages.

Unfortunately I think your test cases do not cover the whole strings (they only return true by picking up any alnum inside):
/\A[[:alnum:]]+\z/ === "This\u200dis a string col·laborar" # false
/\A[[:alnum:]]+\z/ === "0\u200d\u200c\u0f0b\u30fb\u00b7·\u05f3" # false
Enter fullscreen mode Exit fullscreen mode

And sorry for a late edit in my previous comment:

some languages need hyphens and apostrophes as a part of spelling, but I have no idea how much their speakers think it legible without those signs

It includes Irish and Hokkien. What do you think about it?

Thread Thread
 
djuber profile image
Daniel Uber

Yes - thanks for keeping me honest (my initial tests were invalid - I've since edited the reply).

github.com/forem/forem/issues/14745 I opened an issue to clarify the changes expected - unless we encounter technical reasons to prohibit those characters there shouldn't be a reason not to extend the valid set of characters.

I think enforcing the "medial joins must be between characters" rule is probably harder to do strictly than it's worth to get right. I imagine permitting "joiner" characters anywhere in an an otherwise valid string will be permissive, slightly wrong, but also removes the blocking validation from the :alnum: class.

Regarding hyphens - I feel like there's probably some overlap between spelling requirements (English also has words which should be spelled with hyphens, like brother-in-law or twenty-two), and an assumption that tags are "simple" and not composed of phrases. I acknowledge the validity of the language support requirement, I think I'll need to discuss internally with our product team to determine how much that opens the possibility of "really-long-tags-made-of-full-sentences" in a way that's not desired. I don't have a personal opinion about that, but it might be a surprising or unwanted change to an existing and intentional restriction.

Collapse
 
ellativity profile image
Ella (she/her/elle)

Thanks for this changelog, @djuber - and thanks for moving through the PR. I'm particularly enthused by this one!