-
Notifications
You must be signed in to change notification settings - Fork 10
Feature: repository search
We need to show the following materials in our blended search results:
- Regulations
- Public policy documents that we link to:
- Federal Register rules
- Supplemental content: subregulatory guidance, technical assistance, etc.
- Uploaded policy document files (internal to CMCS)
Each kind of material has slightly different fields, so the purpose of this page is to explicitly document how we want search to work.
In our Postgres database we have:
- The full text of regulation sections in scope, imported via eCFR API
- Metadata about each document:
- Imported via Federal Register API for post-1994 rules (and hand-corrected as needed)
- Entered by hand for everything else
- The full text of most documents, extracted via our Text Extractor Lambda function, which uses:
- Google Magika to detect file types
- AWS Textract to process PDFs, including text detection for scanned documents
- A variety of open source libraries to process Outlook, Word, Excel, PowerPoint, RTF, TXT, HTML, image, and ZIP files
We have a special process for extracting the full text of post-1994 Federal Register rules, because their website does not allow scraping their web pages. We fetch their text-only URL via their API and extract from that URL instead of the normal web page.
We use Postgres full-text search via Django's support for Postgres full-text search.
For team members, see "Structure and content of resources" (requires login).
Our metadata fields for FR docs, supplemental content, and uploaded files are shown here as they look on a subject page:
Factors for what you can see:
- Supplemental content and FR docs can be marked "approved" or not approved in the admin panel. Items that aren't approved are only visible in the admin panel (which is only available to logged-in users), never shown in search results or elsewhere on the site.
- If you're not logged in, you cannot see internal documents (uploaded files) in search results or elsewhere on the site.
In search results, we always show the following document metadata if available:
- Document category
- Date
- Subjects
- Related citations
If the desired keyword(s) exist only in the document metadata (FR doc name or description, supplemental content name or description, uploaded file name or summary, etc.), show that document metadata. This means:
- FR doc: name (grey metadata) and description (blue link)
- Supplemental content: name (grey metadata) and description (blue link)
- Uploaded files: name (blue link) and summary (black text)
If the desired keyword(s) also exist in the extracted document text, show the name and description (grey metadata and blue link) AND:
- For all types of documents, show the relevant headline (excerpt) from the full-text content, in black text. (For uploaded files, this headline replaces the summary.)
Team members can look at this doc for context about decisions we made for weights (login required).
This environment variable tells our search system where to "cut off" the results: should it show lots of results, including less relevant results at the end, or should it only show fewer results that are most relevant? Lower numbers (like 0.01) mean show lots of results, and higher numbers (like 0.1) mean show fewer results.
Rank filter should be 0.05 in all environments, for both basic (not quoted) and phrase (quoted) search queries. The rank filter value for each environment is in our parameter store: BASIC_SEARCH_FILTER
and QUOTED_SEARCH_FILTER
.
Weights for documents:
- (FR doc) name: A
- (Supplemental content) name: A
- (Uploaded file) name: A
- (FR doc) description: A
- (Supplemental content) description: A
- (Uploaded file) summary: B
- (Uploaded file) filename: C
- Date: C
- Subjects (full names, short names, and abbreviations): D
- Content: D
We could add citations to the weighting list if we want to.
Weights for regulation text sections:
- Section number: A
- Section title: A
- Part title: A
- Content: B
We may be able to add the subpart title to the weighting list if we want to.
Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.
Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.
For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".
For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).
- Federal policy structured data options
- Regulations
- Resources
- Statute
- Citation formats
- Export data
- 2021
- Reg content sources
- Default content view
- System last updated behavior
- Paragraph indenting
- Content authoring workflow
- Browser support
- Focus in left nav submenu
- Multiple content views
- Content review workflow
- Wayfinding while reading content
- Display of rules and NPRMs in sidebar
- Empty states for supplemental content
- 2022
- 2023
- 2024
- Medicaid and CHIP regulations user experience
- Initial pilot research outline
- Comparative analysis
- Statute research
- Usability study SOP
- 2021
- 2022
- 2023-2024: 🔒 Dovetail (requires login)
- 🔒 Overview (requires login)
- Authentication and authorization
- Frontend caching
- Validation checklist
- Search
- Security tools
- Tests and linting
- Archive