A previous PR added strict frontend and backend validation for tags that disallowed accented characters.
We're relaxing this restriction so that tags with non-ascii strings can be used.
For example, if you're publishing content in Spanish and want to make that more discoverable, you can tag the article with Español (this was an existing tag which was being used - you can see articles at https://dev.to/t/espa%C3%B1ol - but could not be added to new articles so all the content is from May 2020 or earlier).
A Portuguese speaking user recently reported issues importing articles from an external blog. This PR should also reduce errors importing articles from RSS Feeds when categories are labelled in non-ascii characters.
We still won't permit emoji or symbols in tag names, only words and letters. We tested against Polish, Chinese, and Arabic words to get a fair sample of different languages and scripts without an issue when adding this change.
Top comments (6)
Paging @9comindia and @yheuhtozr because you've both made some really great suggestions regarding i18n in the past.
We're making some small steps, and we hope you'll find them helpful.
Thank you for letting me know! This seems a great step towards a multilingual site. I'll definitely take time testing in this weekend, but with a quick glance, the current implementation (
[[:alnum:]]
) looks reasonably solid 👍, except needing some nitpick-level adjustments to cope with the real world:On the other hand, despite OP saying "[w]e still won't permit emoji or symbols in tag names, only words and letters", I can come up with some emojis which escape it (it could be tricky, so leave it up to you whether to fix this hole).Sorry that this test sample is invalid.
Thanks for testing this out, and the additional information about missing character support - I knew the "middle dot" is used in Catalan to separate some double consonants (like in "col·laborar"), and I wouldn't be surprised if there were other Iberian languages using it in a similar manner.
Initially I was trying to test support the same way you demonstrated and misled myself.
We do currently exclude "col·laborar" as a tag name (tested and verified your recommended joiner/modifier character support is lacking). That doesn't look like it's either necessary or intentional, I'll move that into another issue to add that.
I would generally prefer to err on the side of safely accepting too much than to be more restrictive than necessary, I'm not sure whether it's important to restrict the keycaps or variant modifiers you showed in the "key 0" example.
Thank you for the extensive test. At least AFAIK, it'll be sufficient if middle dots can come anywhere other than word-initial for European languages.
Unfortunately I think your test cases do not cover the whole strings (they only return true by picking up any alnum inside):And sorry for a late edit in my previous comment:
It includes Irish and Hokkien. What do you think about it?
Yes - thanks for keeping me honest (my initial tests were invalid - I've since edited the reply).
github.com/forem/forem/issues/14745 I opened an issue to clarify the changes expected - unless we encounter technical reasons to prohibit those characters there shouldn't be a reason not to extend the valid set of characters.
I think enforcing the "medial joins must be between characters" rule is probably harder to do strictly than it's worth to get right. I imagine permitting "joiner" characters anywhere in an an otherwise valid string will be permissive, slightly wrong, but also removes the blocking validation from the
:alnum:
class.Regarding hyphens - I feel like there's probably some overlap between spelling requirements (English also has words which should be spelled with hyphens, like brother-in-law or twenty-two), and an assumption that tags are "simple" and not composed of phrases. I acknowledge the validity of the language support requirement, I think I'll need to discuss internally with our product team to determine how much that opens the possibility of "really-long-tags-made-of-full-sentences" in a way that's not desired. I don't have a personal opinion about that, but it might be a surprising or unwanted change to an existing and intentional restriction.
Thanks for this changelog, @djuber - and thanks for moving through the PR. I'm particularly enthused by this one!