A new setting has been added to manage which articles can be indexed by search engines based on their score. New communities should see more posts indexed by search engines faster.
Background
We identified a few recent pain points among creators where articles in their communities were being ignored by Google and other search engines.
After further investigation, we identified some choices the Forem team had made to limit indexing of spam or low quality posts.
In new communities, where a large body of existing activity, ratings, comments, etc. may not exist, some of the limits might not make sense. One in particular, prioritizing posts with snippets in them as higher quality, was a good fit for DEV and other code centered communities and less meaningful for other Forems.
The Fix
Remove unneeded checks for code tags when filtering articles from search indexing #14801
What type of PR is this? (check all applicable)
- [x] Refactor
- [ ] Feature
- [ ] Bug Fix
- [ ] Optimization
- [ ] Documentation Update
Description
Remove the check for <code>
tags (there are a few test cases that exercise this that can be removed still).
Simplify the presence of two "magic" numbers 3 and 5 in the sitemap generation and article show pages, if an article is going to be included in the sitemap, it should be indexed.
Related Tickets & Documents
Andy's forem.team post
Ben's suggestion from the comment was the guidance here.
QA Instructions, Screenshots, Recordings
Behavior covered by the spec/requests/stories_show_spec.rb cases.
Confirmed new setting shows and can be updated in the admin/customization page:
UI accessibility concerns?
None.
Added/updated tests?
- [ ] Yes
- [x] No, and this is why: refactor
- [ ] I need help with writing tests
[Forem core team only] How will this change be communicated?
Will this PR introduce a change that impacts Forem members or creators, the development process, or any of our internal teams? If so, please note how you will share this change with the people who need to know about it.
- [x] I will share this change internally with the appropriate teams
- [x] I will add a Forem.dev changelog post
- [x] Updated the admin guide
[optional] Are there any post deployment tasks we need to perform?
DEV and possibly other communities should have the created setting set to non-zero.
We will add a new setting "index minimum score" to the User Experience and Branding section of the customization/config admin area.
This setting sets a minimum score. This score will decide which posts appear in the Sitemap. It will also determine whether a post can include the “noindex” and “nofollow” robots meta tags.
This setting will work similar to the tag minimum score and home feed minimum scores that are used to filter, but rather than controlling the displayed feed pages, it controls the search engine crawlers.
Since the two different values were causing some pages to find their way into the sitemap, while also including the noindex/nofollow meta tags, we decided to unify this into a single threshold to avoid sending conflicting signals. The sitemap suggests the crawler should visit the page, the noindex meta instructs the crawler to ignore this page, so it makes sense to keep that consistent.
Negatively scored (down-voted by a moderator) articles will not be indexed, regardless of the value of this setting.
What does this mean for you
If you take no action after this is released, some articles which had been excluded from the sitemap (because they had score lower than 3) or from the search engine results (because of the noindex meta) may start showing up.
Raising this setting above 0 in the settings page will require published posts to clear a higher bar before being indexed by search engine crawlers. This can cut down on spam submissions but won't serve as a complete alternative to moderation.
For posts with active discussions or many heart/unicorn votes, this will not be a problem. For new communities, where there's not a lot of active feedback or discussions, this change should speed up getting content read by Google and available in the search results.
Top comments (8)
@djuber this is an interesting update. The other day I pointed out an indexing issue I was having to @michaeltharrington (also described below) and I think this new setting should help but I would like to understand a bit more on how it will work.
Specifically, how does the new index minimum score setting determine which posts will appear as "noindex" vs allowing indexing? What is the difference between raising the value from 1 to 2 or 5. Also, does setting the value to 0 mean that all posts should be indexable? I noticed in the PR that @andy updated dev.to's value to 3, but I am wondering what this value means exactly and how posts get assigned that number? Is this related to the "Experience Level of Post" mods can assign to posts?
Also through my experience with 1VIBE so far all the posts I have published have been picked up by Google but for some reason this particular post shows as 'noindex' detected in 'robots' meta tag. I am checking which pages Google can index through the url inspection feature via the Google Search Console. Not sure how the Jay-Z post was automatically assigned a noindex value and hundreds of other posts were not.
I am hoping this is the case. I didn't know that some of the content wouldnt be crawled unless it had a 3+ score.
Ben has also recently committed an interesting change to sitemaps that you might want to take a look at @ildi
Add recent resource sitemap endpoint for posts #14857
What type of PR is this? (check all applicable)
Description
This PR adds an endpoint to our sitemap route called
/sitemap-posts.xml
, with an optional number called/sitemap-posts-2.xml
etc.Sitemaps need to be at the root of the domain, which is why the logic like this exists in the first place.
Currently we only have "monthly" sitemaps...
/sitemap-Sep-2021.xml
, which forces an admin to go into Google Search Console at the beginning of every month to add a new sitemap.This allows an admin to provide encapsulated sitemaps and glean information about what Google is searching when, but is not an ideal default sitemap to provide. This is the "default" sitemap for posts on Forem.
There is optionally a zero-indexed value to append here if a Forem has more than 10k published posts, but generally a Forem can get by for a while with just submitting
/sitemap-posts
to Google Search Console.This naming convention opens us up to adding
/sitemap-pages
and/sitemap-users
,/sitemap-organizations
, etc.... But I thought starting here was good, as those would require additional functionality.This simplifies SEO and allows us to give clear instructions to creators:
/sitemap-posts
to Google Search Console to help your crawl rate. This will include the most recent 10k indexable posts on the platform.I have chosen to use "posts" as opposed to "articles" which I think of more of a "private term" if we can help it.
Related Tickets & Documents
There are a couple forem.dev posts about sitemap confusion and this should be a step in helping us provide a good answer.
forem.dev/lee/sitemaps-submission-... forem.dev/akhil/unable-to-submit-s...
Added/updated tests?
[Forem core team only] How will this change be communicated?
Will this PR introduce a change that impacts Forem members or creators, the development process, or any of our internal teams? If so, please note how you will share this change with the people who need to know about it.
CHANGELOG.md
Thank you for pointing this out! I wasn’t aware that Forem communities have sitemaps.
I’m still hoping to understand why this Jay-Z article on 1VIBE has a “noindex” meta tag and how that was automatically assigned to the article.
Hi Ildi,
There are a few checks in place that are factoring in making this decision - the change described here removed one favoring
code
tags as DEV specific, and unified the two magic numbers which had been 3 and 5 for the sitemap and robots meta, as well as making it tunable in admin.If an article has a negative score (more spam reactions than votes) it will not be indexed. If an article is a draft (not published) it's not indexable. If an article was published before July 13th, 2017 (unlikely for you, this shows as the
featured_number
in the code and seems DEV specific) it is not indexable.The last condition requires all of the following to be true and is an anti-spam filter:
When a new article is created - the score is lower than the previous "magic" number 5, but should be higher than the current default 0 (if you haven't set this in admin).
I checked the two top articles for Jay-Z on 1vibe (1vibe.com/ildi/the-blueprint-chang... and 1vibe.com/1vibeteam/why-jay-z-s-co...) and didn't see the robots meta there now - the more recent one was published Sept 11th (before the changes were made) so it's possible at the time it was marked noindex and no longer is.
The article score is the sum of user reactions (hearts and unicorns from the left sidebar), with each counting 1 point (and an automatic like reaction created when you author the post). The author's reactions' scores (reactions on the user, rather than the article) are also factored in - this is normally used by moderators to downvote an abusive or spammy user. The actual code setting the score is in github but effectively the scoring is "more hearts means a higher score".
The old way could have marked an article no-index if the article had too few heart reactions, was not featured, and contained no code tag. The new way should remove that minimum popularity threshold at the default value of zero, which is reached for all published articles automatically.
Hey Daniel,
Thank you for breaking this down for me. It’s very interesting to understand the initial logic for when the “noindex” rule only applied to DEV vs all the Forem communities today.
The Jay-Z deepfake article is indeed now indexable by Google. Based on your explanation, the cause of the “noindex” was most likely due to the author of that article having had 0 comments on 1VIBE when posting the article, even though the article was published on behalf of an organization, the 1VIBE Team in this case.
It makes sense that these rules were put into place considering the size of DEV and how many posts users must be making on the site every 24hrs. I’m glad that these settings can now be customized, they will be useful for SEO and dealing with spam.
Great explanation, thanks for this. Does this mean that articles on new Forem pre the update would have noindex applied if the score was low?
@lee yes, that issue with articles published but not search indexed had been observed and reported in a few cases (including Ildi's case above), the decision to include or exclude those meta tags is done "live" when you view the page, or when you generate the sitemap, so sites that have updated in the past week should already be seeing the change and more articles get picked up by search engines, If Forem hosts your site, these updates happen about twice per week, if you're self-hosted then you'd want to update to a current container image to get this change.
and here
Explain new simpler process for submitting sitemap to Google Search Console #17
Explain the changes
Admins can now be instructed to submit
/sitemap-posts.xml
as the default way to submit to Google Search Console if this is merged:github.com/forem/forem/pull/14857
They can optionally be instructed to submit monthly sitemaps if they want to get specific breakdowns month-over-moth, but this is more advanced functionality for very large Forems. This functionality has already been in place, but we did not have the more basic functionality in place.