Schema.org Usage Dataset: What SEOs Can Actually Learn
Google and Schema.org published domain-level usage buckets for thousands of schema terms. Learn how SEOs should use the data without turning it into ranking fiction.
Schema.org Usage Dataset Recap
Watch the 2-minute structured data breakdown
A concise walkthrough of what Google's public crawl data says, what it does not say, and how SEO teams should use the buckets in schema audits.
TL;DR: Google and Schema.org now publish public usage buckets for Schema.org terms seen across unique domains. Use the data as an adoption benchmark for structured data audits, not as a ranking factor list or rich result report.
The useful part is the shape of adoption: a tiny head of common schema terms and a huge tail of rarely used vocabulary. The risky part is forgetting that the unit is domains, not pages.
Google and Schema.org have finally given structured data teams a public adoption reference.
On June 4, 2026, Schema.org announced the Schema.org Usage Statistics Dataset, created with Google and published in the official Schema.org GitHub repository. The files are available as CSV, JSON, and summary JSON. The first public snapshot covers May 2026.
That matters because structured data work has always had a gap between documentation and adoption. We can check Google's rich result rules. We can validate markup. We can audit templates. We can compare competitors. What we did not have was an official, crawl-scale reference showing which Schema.org terms appear across domains.
Now we do.
But the first mistake would be reading this as ranking data.
The dataset does not tell you which schema types boost rankings. It does not tell you which pages get rich results. It does not count JSON-LD blocks. It does not count URLs.
It tells you which Schema.org terms were observed across unique domains in Google's public crawl, grouped into adoption ranges.
That is still useful. It is useful for schema audits, CMS defaults, ecommerce roadmaps, stakeholder education, and deciding when a niche term deserves custom implementation.
What Google And Schema.org Published
The official Schema.org announcement explains that the dataset reports high-level term usage across millions of domains. The companion documentation at schema.org/docs/usage_stats.html explains the main boundaries:
- The data comes from Google's public web crawling infrastructure.
- Frequencies are aggregated at the domain level.
- Exact counts are not published.
- Terms are grouped into buckets.
- JSON-LD, Microdata, and RDFa are merged.
- Sites blocked from Google's crawl are not included.
The May 2026 data file has three practical columns.
| Field | What it means | Why SEOs should care |
|---|---|---|
Class |
Whether the row is an Itemtype or Predicate. | Helps separate types from properties during audits. |
Name |
The Schema.org term. | Lets you map detected or proposed schema to the official term list. |
Domain Bucket |
The unique-domain range where the term was observed. | Gives a public adoption benchmark, not a page count. |
Natzir's Schema.org public usage explorer is a helpful way to browse the data. Use it for searching, filtering, and explaining the buckets to non-technical teams.
For final numeric claims, use the official CSV or JSON from the Schema.org repository.
The May 2026 Numbers
The May 2026 snapshot contains 5,545 Schema.org terms.
| Metric | Value |
|---|---|
| Total terms in May 2026 file | 5,545 |
| Types / Itemtypes | 958 |
| Properties / Predicates | 4,587 |
Terms in 10M+ domain bucket | 43 |
Terms in 1M - 10M domain bucket | 100 |
Terms in 100K - 1M domain bucket | 158 |
Terms in 10K - 100K domain bucket | 420 |
Terms in 1K - 10K domain bucket | 560 |
Terms in < 1K domain bucket | 4,264 |
Share of all terms in < 1K bucket | 76.9% |
| Share of all terms below 10K domains | 87.0% |
| Share of all terms at 1M+ domains | 2.6% |
The shape is clear: a small head, a huge tail.
Only 43 terms sit in the 10M+ domain bucket. Only 143 terms sit at 1M+ domains. Most terms sit below 10K domains, and more than three quarters sit in the < 1K bucket.
That does not make the long tail worthless. It means the common web has a small set of reusable structured data patterns, while specialized vocabulary serves specialized content.
Read The Unit Correctly: Domains, Not Pages
The most important caveat is the unit of measurement.
This is unique-domain data.
If a retail site uses Product schema on 500,000 URLs, it contributes one domain to the Product bucket. If a publisher marks up every article with Article, that domain still contributes one domain to the Article bucket.
That means the dataset answers this question:
Across how many unique domains did Google's public crawl observe this Schema.org term, grouped into ranges?
It does not answer these questions:
- How many pages use the term?
- How many markup objects exist?
- How many rich results were triggered?
- How many impressions or clicks came from pages with the term?
- Which syntax was used?
- Which implementations were valid?
This boundary matters in client work. SEO teams usually audit URLs, templates, and page types. This dataset audits adoption breadth by domain.
Use it as a benchmark. Do not treat it as a crawl report for your site.
Buckets Are Not Exact Counts
The dataset groups terms into ranges:
| Bucket | Meaning |
|---|---|
10M+ | Seen on more than ten million domains. |
1M - 10M | Seen on one million to ten million domains. |
100K - 1M | Seen on one hundred thousand to one million domains. |
10K - 100K | Seen on ten thousand to one hundred thousand domains. |
1K - 10K | Seen on one thousand to ten thousand domains. |
< 1K | Seen on fewer than one thousand domains. |
That is precise enough for prioritization. It is not precise enough for claims like "Google found 8,221,409 Product implementations."
Say Product is in the 1M - 10M domain bucket in the May 2026 file. Do not invent exact counts.
The Dataset Does Not Separate JSON-LD, Microdata, And RDFa
Many SEOs will want to use this dataset to argue about syntax.
Do not.
Schema.org's usage documentation says JSON-LD, Microdata, and RDFa are merged into one statistic. That means the dataset cannot prove which markup syntax is used more.
For syntax decisions, use Google's structured data documentation, your CMS constraints, developer maintenance cost, validation results, and the reliability of your source data.
What The High-Adoption Terms Tell Us
The top bucket is mostly structural.
| Type | May 2026 domain bucket | Practical interpretation |
|---|---|---|
BreadcrumbList | 10M+ | Common site navigation and hierarchy markup. |
WebSite | 10M+ | Site-level identity and search actions. |
WebPage | 10M+ | Generic page description layer. |
Organization | 10M+ | Brand, company, publisher, and entity markup. |
Person | 10M+ | Author, profile, and person references. |
SearchAction | 10M+ | Site search action markup. |
That lines up with how structured data ships on many sites: SEO plugins, CMS defaults, article templates, breadcrumbs, organization blocks, and page-level schema.
The next tier moves into commerce, reviews, articles, video, and local business data.
| Type | May 2026 domain bucket | Where it usually matters |
|---|---|---|
Product | 1M - 10M | Product detail pages and ecommerce catalogs. |
Offer | 1M - 10M | Price, availability, and sales terms. |
Review | 1M - 10M | Review content and product feedback. |
AggregateRating | 1M - 10M | Aggregate rating summaries where policy allows. |
FAQPage | 1M - 10M | FAQ content, with current Google visibility limits in mind. |
Article | 1M - 10M | Editorial pages and content publishing. |
BlogPosting | 1M - 10M | Blog templates and post archives. |
VideoObject | 1M - 10M | Pages with real video assets. |
LocalBusiness | 1M - 10M | Local entity pages where the business is represented. |
Service | 1M - 10M | Service pages with clear service information. |
This helps with prioritization.
If an ecommerce site has product pages with missing or invalid Product, Offer, price, availability, sku, and brand data, that is a practical schema issue. These concepts sit in high adoption ranges because many sites and platforms need them.
That does not mean every implementation is correct. It means the concepts are common enough that your site should have a clear reason if they are absent from pages that support them.
This connects with the machine-readable commerce work I covered in Google Universal Cart: AI Agents Change Ecommerce SEO. The usage dataset does not measure AI agent readiness, but it can help ecommerce teams separate baseline product data from custom schema maturity work.
High Adoption Is Not The Same As SEO Value
A high domain bucket means many domains use a term. It does not mean Google rewards the term with higher rankings.
For example, FAQPage appears in the 1M - 10M domain bucket in the May 2026 snapshot. That does not undo Google's reduced FAQ rich result visibility for most sites.
If your team is reviewing old FAQ markup, pair this dataset with the practical guidance in Google Removed FAQ Rich Results. What Should SEO Teams Do Now?. Adoption can tell you that a term is widely used. It cannot tell you that a rich result still appears for your page.
The same rule applies across structured data. Common terms deserve attention because they may represent baseline page meaning, CMS defaults, or commerce requirements. They still need page-level validation and a real business case.
The Long Tail Is Where Teams Need Judgment
The May 2026 file puts 4,264 terms in the < 1K bucket. That is 76.9% of all terms.
It would be a mistake to treat that bucket as a discard pile.
Some Schema.org terms are narrow because they belong to narrow content types. Medical, education, government, scientific, event, transport, and dataset vocabulary will never appear on as many domains as name, url, image, or WebPage.
Some terms are newer. Some need source data that many CMS setups do not store. Some lack a visible rich result, so plugin authors rarely make them defaults.
A rare term can still be the right term.
The decision should be based on fit:
| Question | Good reason to use a low-bucket term | Weak reason to use it |
|---|---|---|
| Does the page truly match the term? | A dataset page using Dataset. |
A generic blog post adding Dataset because the term exists. |
| Is the information visible or supported? | A course page with real course metadata. | A template creating fields users never see. |
| Is source data reliable? | A product system with maintained return policy fields. | Manual fields that decay after launch. |
| Does the business case matter? | A regulated publisher using precise domain vocabulary. | A sitewide schema block copied from a competitor. |
Low adoption should trigger better questions, not automatic rejection.
How To Use This In A Schema Audit
Add a Public usage bucket column to your structured data audit.
Then use the bucket as context, not the verdict.
| Term | Current site status | May 2026 bucket | Audit action |
|---|---|---|---|
BreadcrumbList |
Missing from article templates | 10M+ | Add if breadcrumbs are visible and template data is stable. |
Product |
Present but missing price and availability | 1M - 10M | Fix source data and validation before adding new schema ideas. |
MerchantReturnPolicy |
Not implemented | 100K - 1M | Consider if return policy fields are reliable and relevant. |
Dataset |
Proposed for a research library | 10K - 100K | Use where the page is a real dataset page. |
MedicalCondition |
Proposed for generic wellness posts | 10K - 100K | Use only when the content and policy context support it. |
This changes the meeting.
Instead of saying "we should add schema because competitors have it," you can say:
This term is common across domains, our page type supports it, and our CMS has the right fields.
Or:
This term is valid for a niche use case, but our page does not support it and our source data is weak.
That is how the dataset earns its place in SEO work.
A Practical Decision Workflow
Use this sequence before adding or removing structured data.
| Step | Question | Output |
|---|---|---|
| 1. Bucket check | Where does the term sit in the May 2026 public stats file? | Adoption context. |
| 2. Eligibility check | Does Google document a relevant search feature for this page type? | Search feature context. |
| 3. Page-content check | Is the information visible or clearly represented on the page? | Accuracy check. |
| 4. Source-data check | Can the CMS, feed, PIM, CRM, or editorial system maintain it? | Maintenance check. |
| 5. Validation check | Does the markup pass Schema.org and Google validation tools? | Technical QA. |
| 6. Business-value check | Does it help search engines, AI systems, commerce systems, QA, or internal tooling understand the page? | Priority call. |
| 7. Monitoring check | Can you detect regressions after template, plugin, or feed changes? | Ongoing control. |
The public bucket is step one. It is not the final answer.
For teams working on AI search visibility, this same discipline matters. In Google's AI Search Documentation Is Finally Here, the practical point was that AI search still depends on crawlable, understandable, well-structured content. Schema.org usage data can support that work, but it does not replace content quality, source reliability, or search policy.
What CMS And Plugin Teams Should Do
If you build SEO tooling, WordPress plugins, Shopify apps, headless CMS modules, or ecommerce templates, this dataset is a useful benchmark.
It can help you decide:
- Which common terms deserve first-class defaults.
- Which fields need admin UI support.
- Which schema modules should stay optional.
- Which niche terms need warnings, docs, or source-data requirements.
- Which existing defaults may be too generic for the page type.
For example, article templates should handle Article or BlogPosting, author, publisher, headline, image, date published, and date modified cleanly. Product templates should not emit half-populated Product markup without reliable offer data. Local templates should not place LocalBusiness markup on every page when the page does not represent a real local entity.
The dataset will not design your defaults for you. It gives you a public adoption reference to test those defaults against.
What Ecommerce Teams Should Do
Ecommerce teams should read the dataset beside product feed quality, Merchant Center setup, return policies, availability data, review policy, and checkout readiness.
Several commerce terms sit in high adoption ranges:
| Commerce term | May 2026 bucket |
|---|---|
Product | 1M - 10M |
Offer | 1M - 10M |
price | 1M - 10M |
priceCurrency | 1M - 10M |
availability | 1M - 10M |
sku | 1M - 10M |
brand | 1M - 10M |
aggregateRating | 1M - 10M |
review | 1M - 10M |
MerchantReturnPolicy | 100K - 1M |
This does not prove that stores with those terms perform better.
It does show that product, offer, pricing, availability, brand, SKU, review, and policy data are common machine-readable commerce concepts. If your store lacks them, the next step is not chasing exotic schema. The next step is fixing the product data layer.
What Not To Do With The Dataset
- Do not turn adoption into ranking claims.
- Do not compare the buckets to page counts.
- Do not claim the file proves JSON-LD usage share.
- Do not remove accurate niche schema only because it sits in a low bucket.
- Do not add a high-bucket term when the page does not support it.
- Do not use Natzir's explorer as the final numeric source when the official CSV and JSON are available.
- Do not treat the May 2026 snapshot as permanent. Schema.org says the files are planned for monthly updates, so rerun your comparisons when new snapshots arrive.
Practical Takeaway
The Schema.org usage dataset gives SEOs a public adoption benchmark for structured data vocabulary.
Use it to make audits sharper. Use it to help stakeholders understand why baseline schema matters. Use it to separate common template issues from niche enhancements. Use it to improve plugin and CMS defaults.
Keep the boundary clear:
- Domains, not pages.
- Buckets, not exact counts.
- Google's public crawl, not the full web.
- Merged formats, not syntax share.
- Adoption data, not ranking data.
The best use of the dataset is simple: make structured data decisions less anecdotal.
Start with the bucket. Then validate the page, the policy, the source data, the markup, and the business value.
FAQ
Is the Schema.org usage dataset a Google ranking factor list?
No. It is an adoption dataset. A high domain bucket means a term appears across many unique domains in Google's public crawl. It does not mean the term has ranking weight.
Does the dataset count pages or URLs?
No. It counts unique domains. A term used on many pages of the same domain still contributes one domain to that term's bucket.
Does it separate JSON-LD, Microdata, and RDFa?
No. Schema.org's usage documentation says those formats are combined into one statistic, so the dataset cannot answer syntax-share questions.
Should I remove schema types that sit in the < 1K bucket?
No. Low adoption can mean a term is specialized, new, difficult to populate, or relevant to a narrow content model. Remove it only if it is inaccurate, unsupported, invalid, or has no business case.
Should I add every term in a high bucket?
No. Add schema only when it accurately describes the page and can be maintained from reliable data. A high bucket gives context, not permission to mark up unrelated content.
What is the best way to use this in an SEO audit?
Add the public usage bucket as a column, then review each term against page eligibility, visible content, source data, validation, business value, and monitoring.
Is Natzir's explorer official?
No. It is a helpful third-party interface for exploring the official public stats dataset. Use the official Schema.org GitHub CSV or JSON for final numeric claims.
Does this change Google's structured data guidelines?
No. It gives the community adoption data. You still need to follow Google's structured data documentation, quality rules, and rich result requirements for the specific page type.
