Forem Creators and Builders

Daniel Uber
Daniel Uber

Posted on

Changelog: When is an article indexed by a search engine?

A new setting has been added to manage which articles can be indexed by search engines based on their score. New communities should see more posts indexed by search engines faster.

Background

We identified a few recent pain points among creators where articles in their communities were being ignored by Google and other search engines.

After further investigation, we identified some choices the Forem team had made to limit indexing of spam or low quality posts.
In new communities, where a large body of existing activity, ratings, comments, etc. may not exist, some of the limits might not make sense. One in particular, prioritizing posts with snippets in them as higher quality, was a good fit for DEV and other code centered communities and less meaningful for other Forems.

The Fix

Remove unneeded checks for code tags when filtering articles from search indexing #14801

What type of PR is this? (check all applicable)

  • [x] Refactor
  • [ ] Feature
  • [ ] Bug Fix
  • [ ] Optimization
  • [ ] Documentation Update

Description

Remove the check for <code> tags (there are a few test cases that exercise this that can be removed still). Simplify the presence of two "magic" numbers 3 and 5 in the sitemap generation and article show pages, if an article is going to be included in the sitemap, it should be indexed.

Related Tickets & Documents

Andy's forem.team post

Ben's suggestion from the comment was the guidance here.

QA Instructions, Screenshots, Recordings

Behavior covered by the spec/requests/stories_show_spec.rb cases.

Confirmed new setting shows and can be updated in the admin/customization page:

Admin view

UI accessibility concerns?

None.

Added/updated tests?

  • [ ] Yes
  • [x] No, and this is why: refactor
  • [ ] I need help with writing tests

[Forem core team only] How will this change be communicated?

Will this PR introduce a change that impacts Forem members or creators, the development process, or any of our internal teams? If so, please note how you will share this change with the people who need to know about it.

  • [x] I will share this change internally with the appropriate teams
  • [x] I will add a Forem.dev changelog post
  • [x] Updated the admin guide

[optional] Are there any post deployment tasks we need to perform?

DEV and possibly other communities should have the created setting set to non-zero.


We will add a new setting "index minimum score" to the User Experience and Branding section of the customization/config admin area.

User experience settings including index minimum score

This setting sets a minimum score. This score will decide which posts appear in the Sitemap. It will also determine whether a post can include the “noindex” and “nofollow” robots meta tags.

This setting will work similar to the tag minimum score and home feed minimum scores that are used to filter, but rather than controlling the displayed feed pages, it controls the search engine crawlers.

Since the two different values were causing some pages to find their way into the sitemap, while also including the noindex/nofollow meta tags, we decided to unify this into a single threshold to avoid sending conflicting signals. The sitemap suggests the crawler should visit the page, the noindex meta instructs the crawler to ignore this page, so it makes sense to keep that consistent.

Negatively scored (down-voted by a moderator) articles will not be indexed, regardless of the value of this setting.

What does this mean for you

If you take no action after this is released, some articles which had been excluded from the sitemap (because they had score lower than 3) or from the search engine results (because of the noindex meta) may start showing up.

Raising this setting above 0 in the settings page will require published posts to clear a higher bar before being indexed by search engine crawlers. This can cut down on spam submissions but won't serve as a complete alternative to moderation.

For posts with active discussions or many heart/unicorn votes, this will not be a problem. For new communities, where there's not a lot of active feedback or discussions, this change should speed up getting content read by Google and available in the search results.

Discussion (8)

Collapse
ildi profile image
Ildi • Edited

@djuber this is an interesting update. The other day I pointed out an indexing issue I was having to @michaeltharrington (also described below) and I think this new setting should help but I would like to understand a bit more on how it will work.

The sitemap suggests the crawler should visit the page, the noindex meta instructs the crawler to ignore this page, so it makes sense to keep that consistent.

Specifically, how does the new index minimum score setting determine which posts will appear as "noindex" vs allowing indexing? What is the difference between raising the value from 1 to 2 or 5. Also, does setting the value to 0 mean that all posts should be indexable? I noticed in the PR that @andy updated dev.to's value to 3, but I am wondering what this value means exactly and how posts get assigned that number? Is this related to the "Experience Level of Post" mods can assign to posts?

Also through my experience with 1VIBE so far all the posts I have published have been picked up by Google but for some reason this particular post shows as 'noindex' detected in 'robots' meta tag. I am checking which pages Google can index through the url inspection feature via the Google Search Console. Not sure how the Jay-Z post was automatically assigned a noindex value and hundreds of other posts were not.

Collapse
lee profile image
Lee

Also, does setting the value to 0 mean that all posts should be indexable?

I am hoping this is the case. I didn't know that some of the content wouldnt be crawled unless it had a 3+ score.

Ben has also recently committed an interesting change to sitemaps that you might want to take a look at @ildi

Add recent resource sitemap endpoint for posts #14857

What type of PR is this? (check all applicable)

  • [ ] Refactor
  • [x] Feature
  • [ ] Bug Fix
  • [ ] Optimization
  • [ ] Documentation Update

Description

This PR adds an endpoint to our sitemap route called /sitemap-posts.xml, with an optional number called /sitemap-posts-2.xml etc.

Sitemaps need to be at the root of the domain, which is why the logic like this exists in the first place.

Currently we only have "monthly" sitemaps... /sitemap-Sep-2021.xml, which forces an admin to go into Google Search Console at the beginning of every month to add a new sitemap.

This allows an admin to provide encapsulated sitemaps and glean information about what Google is searching when, but is not an ideal default sitemap to provide. This is the "default" sitemap for posts on Forem.

There is optionally a zero-indexed value to append here if a Forem has more than 10k published posts, but generally a Forem can get by for a while with just submitting /sitemap-posts to Google Search Console.

This naming convention opens us up to adding /sitemap-pages and /sitemap-users, /sitemap-organizations, etc.... But I thought starting here was good, as those would require additional functionality.

Screen Shot 2021-09-29 at 11 13 16 AM

This simplifies SEO and allows us to give clear instructions to creators:

  1. Submit /sitemap-posts to Google Search Console to help your crawl rate. This will include the most recent 10k indexable posts on the platform.
  2. Optionally submit monthly sitemaps if you want to see a breakdown of crawl based on month.

I have chosen to use "posts" as opposed to "articles" which I think of more of a "private term" if we can help it.

Related Tickets & Documents

There are a couple forem.dev posts about sitemap confusion and this should be a step in helping us provide a good answer.

forem.dev/lee/sitemaps-submission-... forem.dev/akhil/unable-to-submit-s...

Added/updated tests?

  • [x] Yes

[Forem core team only] How will this change be communicated?

Will this PR introduce a change that impacts Forem members or creators, the development process, or any of our internal teams? If so, please note how you will share this change with the people who need to know about it.

  • [ ] I've updated the Developer Docs or Storybook (for Crayons components)
  • [x] This PR changes the Forem platform and our documentation needs to be updated. I have filled out the Changes Requested issue template so Community Success can help update the Admin Docs appropriately.
  • [ ] I've updated the README or added inline documentation
  • [ ] I've added an entry to CHANGELOG.md
  • [ ] I will share this change in a Changelog or in a forem.dev post
  • [ ] I will share this change internally with the appropriate teams
  • [ ] I'm not sure how best to communicate this change and need help
  • [ ] This change does not need to be communicated, and this is why not: please replace this line with details on why this change doesn't need to be shared
Collapse
ildi profile image
Ildi

Thank you for pointing this out! I wasn’t aware that Forem communities have sitemaps.

I’m still hoping to understand why this Jay-Z article on 1VIBE has a “noindex” meta tag and how that was automatically assigned to the article.

Thread Thread
djuber profile image
Daniel Uber Author

Hi Ildi,

There are a few checks in place that are factoring in making this decision - the change described here removed one favoring code tags as DEV specific, and unified the two magic numbers which had been 3 and 5 for the sitemap and robots meta, as well as making it tunable in admin.

If an article has a negative score (more spam reactions than votes) it will not be indexed. If an article is a draft (not published) it's not indexable. If an article was published before July 13th, 2017 (unlikely for you, this shows as the featured_number in the code and seems DEV specific) it is not indexable.

The last condition requires all of the following to be true and is an anti-spam filter:

  • score less than the minimum (this had been a constant 5 before, the default now is 0), the idea being low quality posts would be excluded
  • no comments from this user ("active" community members participating in discussions didn't fit the spam pattern)
  • post is not featured (any featured post should be indexable and not include that robots meta)

When a new article is created - the score is lower than the previous "magic" number 5, but should be higher than the current default 0 (if you haven't set this in admin).

I checked the two top articles for Jay-Z on 1vibe (1vibe.com/ildi/the-blueprint-chang... and 1vibe.com/1vibeteam/why-jay-z-s-co...) and didn't see the robots meta there now - the more recent one was published Sept 11th (before the changes were made) so it's possible at the time it was marked noindex and no longer is.

I am wondering what this value means exactly and how posts get assigned that number?

The article score is the sum of user reactions (hearts and unicorns from the left sidebar), with each counting 1 point (and an automatic like reaction created when you author the post). The author's reactions' scores (reactions on the user, rather than the article) are also factored in - this is normally used by moderators to downvote an abusive or spammy user. The actual code setting the score is in github but effectively the scoring is "more hearts means a higher score".

The old way could have marked an article no-index if the article had too few heart reactions, was not featured, and contained no code tag. The new way should remove that minimum popularity threshold at the default value of zero, which is reached for all published articles automatically.

Thread Thread
ildi profile image
Ildi

Hey Daniel,

Thank you for breaking this down for me. It’s very interesting to understand the initial logic for when the “noindex” rule only applied to DEV vs all the Forem communities today.

The Jay-Z deepfake article is indeed now indexable by Google. Based on your explanation, the cause of the “noindex” was most likely due to the author of that article having had 0 comments on 1VIBE when posting the article, even though the article was published on behalf of an organization, the 1VIBE Team in this case.

It makes sense that these rules were put into place considering the size of DEV and how many posts users must be making on the site every 24hrs. I’m glad that these settings can now be customized, they will be useful for SEO and dealing with spam.

Thread Thread
lee profile image
Lee

Great explanation, thanks for this. Does this mean that articles on new Forem pre the update would have noindex applied if the score was low?

Thread Thread
djuber profile image
Daniel Uber Author

@lee yes, that issue with articles published but not search indexed had been observed and reported in a few cases (including Ildi's case above), the decision to include or exclude those meta tags is done "live" when you view the page, or when you generate the sitemap, so sites that have updated in the past week should already be seeing the change and more articles get picked up by search engines, If Forem hosts your site, these updates happen about twice per week, if you're self-hosted then you'd want to update to a current container image to get this change.

Collapse
lee profile image
Lee

and here

Explain new simpler process for submitting sitemap to Google Search Console #17

Explain the changes

Admins can now be instructed to submit /sitemap-posts.xml as the default way to submit to Google Search Console if this is merged:

github.com/forem/forem/pull/14857

They can optionally be instructed to submit monthly sitemaps if they want to get specific breakdowns month-over-moth, but this is more advanced functionality for very large Forems. This functionality has already been in place, but we did not have the more basic functionality in place.