content_extractions - Ethan Young

## `content_extractions` To account for evolving extraction methods, ExtractionResult represents the [[crawl_results]] as extracted at this point in time. The final `ContentExtraction` includes the following fields: ``` # Required fields self.extraction_method = extraction_method self.timestamp = timestamp self.text = text self.url = url self.fingerprint = fingerprint # Optional ID fields self.content_extraction_id = content_extraction_id self.crawl_result_id = crawl_result_id self.client_id = client_id # Optional content fields self.title = title self.author = author self.publication_date = publication_date self.modified_date = modified_date self.description = description self.site_name = site_name self.categories = categories self.tags = tags self.language = language self.word_count = word_count self.hostname = hostname self.image = image self.page_license = page_license self.page_type = page_type self.comments = comments ``` The unique constraint is `(raw_page_id, extraction_method)` since each raw page can have multiple extraction attempts. Extraction should exclude PPI. The problem with HTML hashes is they change for cosmetic reasons. That is why we add the content_hash, which will tell us if the content we value actually changed. Consider adding `extraction_duration` or other extraction metrics (this is not written yet so will likely come later)