## `content_extractions`
To account for evolving extraction methods, ExtractionResult represents the [[crawl_results]] as extracted at this point in time. The final `ContentExtraction` includes the following fields:
```
# Required fields
self.extraction_method = extraction_method
self.timestamp = timestamp
self.text = text
self.url = url
self.fingerprint = fingerprint
# Optional ID fields
self.content_extraction_id = content_extraction_id
self.crawl_result_id = crawl_result_id
self.client_id = client_id
# Optional content fields
self.title = title
self.author = author
self.publication_date = publication_date
self.modified_date = modified_date
self.description = description
self.site_name = site_name
self.categories = categories
self.tags = tags
self.language = language
self.word_count = word_count
self.hostname = hostname
self.image = image
self.page_license = page_license
self.page_type = page_type
self.comments = comments
```
The unique constraint is `(raw_page_id, extraction_method)` since each raw page can have multiple extraction attempts.
Extraction should exclude PPI.
The problem with HTML hashes is they change for cosmetic reasons. That is why we add the content_hash, which will tell us if the content we value actually changed.
Consider adding `extraction_duration` or other extraction metrics (this is not written yet so will likely come later)