# Cover Letter https://www.documentcrunch.com/ https://apply.workable.com/document-crunch/j/2C06F22DB0/ Draft: https://docs.google.com/document/d/1ZNX7O_5zBrJIs8GVGM4eARB7IL6_Sgrmw-cHyP4rU5w/edit?usp=sharing alt: https://docs.google.com/document/d/1pxxTk7PaezxlmMAAinQBkTbCp7IIiyDTFxSWPAjgoGY/edit?usp=sharing %% ## Name-drop - [[Design, Develop, & Deploy a Python Application]] - [[Domain-Driven Design (DDD)]] - [[Information Retrieval (IR)]] - [[Retrieval Augmented Generation (RAG)]] - [[Natural Language Processing (NLP)]] ![[Information Retrieval System.canvas|Information Retrieval System]] --- ## Draft I know how to implement RAG best-practices, and I keep up with the latest research papers. I will not pose as a gifted programmer, but I have fallen in love with machine learning and created an application to bring cutting edge technology to Public Relations. You may find a better programmer, but you will not find someone more obsessed with applying RAG to business. I have 5 years of experience as a Public Relations professional; I'm a people person first and foremost. Within that same time frame, I've dedicated my free time to learning the foundations of computer science and programming with Python, and two years ago I fell in love with machine learning. I don’t have a traditional background in computer science. Instead, I taught myself how computers work from the logic gates up (link to your GH with nand2tetris work!), and read the latest papers on RAG to find solutions to problems  but distinctly caters to my industry It features a RAG operation involving IR from two distinct data sources: Pinecone (which stores my client’s blogs) and GDELT (a web news API). The combined results are injected into an LLM to produce the final representation, a highly personalized, conversational response with fewer hallucinations and up-to-date information. The first major issue I encountered while developing my search system was making the link between my client's documents and monitoring the latest web news. How do you distill hundreds of blogs into human-readable topical representations, i.e., titles? The solution involved topic modeling the client corpus to identify underlying themes. For example, 700 blogs published within the past 10 years became 2,500 chunks of text that were categorized into 70 clusters using advanced NLP techniques. Finally, I used an LLM to generate an abstractive title of the most representative texts per cluster. Effectively, newsHack monitors web news for each cluster within my client's blog history. **I tackle problems in my search system as they arise by keeping up with the latest AI, IR, and ML research papers. For example, the first major issue I encountered in development was to programmatically and intelligently categorize hundreds of blogs and create human-readable titles for the groups. The solution involved using advanced NLP techniques to topic model the blogs to identify underlying themes. For example, 700 blogs published within the past 10 years became 2,500 chunks of text that were categorized into 70 clusters. Finally, I used an LLM to generate an abstractive title of the most representative texts per cluster. Effectively, newsHack uses these titles to monitor web news for each cluster within my client's blog history.** Another major problem I encountered when developing the search system was poor retrieval of client documents. I learned the lackluster recall was due to asymmetric semantic search, where my query did not resemble the documents being retrieved in formatting, length, or style, even after normalizing and chunking. I found Hypothetical Document Embeddings (HyDE) significantly improved recall and subsequent LLM generation. **I love engineering, and one of my favorite ways to share what I learn is to develop systems of thinking and diagrams to share with non-technical people.** --- If you're looking for an insanely gifted or highly credentialed Computer Scientist, don't read this cover letter. %% For the past 18 months I've been bridging the technology gap in the Public Relations and Communications industry by integrating bleeding-edge AI and ML technologies into my agency's daily workflow to serve dozens of clients. I can help Document Crunch implement relevant technologies into their existing tech stack to create more value for its users, in line with your CEO and new CTOs pledge to make construction contracts more accessible to every stakeholder. The job listing puts [[Retrieval Augmented Generation (RAG)]] expertise as the first and foremost requirement for the position, but RAG is predicated on a much more established but lesser-known field called [[Information Retrieval (IR)]]. IR systems assist in searching for, locating, and presenting information that matches a user's search query or otherwise satisfies their information need. I see RAG as the presentation layer of the broader IR system.  %% I say this not to denigrate RAG; I know the value of such systems because I've built them from scratch for my agency to better serve its clients. Where users were once tasked with analyzing search results post-retrieval, advancements in [[Large Language Model (LLM)]] reasoning allow IR systems, through RAG techniques, to return a concise response based on all relevant documents. This is insanely valuable for almost everyone. Rather, I want to emphasize that RAG isn't even the beginning of a comprehensive search system. Actually, it's at the end. %% %% A complete search system includes the following tasks: [[Text Normalization]], [[Text Chunking]], [[Metadata Enrichment]], [[Composite Embedding]], Storage in a [[Vector Database]], [[Information Retrieval (IR)]], [[Hybrid Search]] + [[Hypothetical Document Embeddings (HyDE)[](Re-ranker.md)d)]], [[Reverse Repacking]] and [[RECOMP]] for [[Retrieval Augmented Generation (RAG)]], and Presentation ([[Retrieval Augmented Generation (RAG)]]) %% %% To expand on IR, and at risk of boring you to death or telling you what you already know (depending on who's reading this), there are two mainstream paradigms for IR:  1. lexical-based sparse retrieval 2. embedding-based dense retrieval The first paradigm within IR is the typical keyword search we're all familiar with, like [[Okapi BM25]] or its variants. The second paradigm within IR is the more nuanced semantic search. Another emerging paradigm within IR is RAG. Semantic search is distinct from and predates RAG by 8 years or more. %% I've built an application that follows industry IR and RAG best-practices, but is distinctly catered to the Public Relations and Communications industry. Technically, it uses a RAG operation involving IR from two distinct data sources: Pinecone (client docs) and GDELT (a web news API). The combined result is injected into an LLM, specifically OpenAI's gpt-4o, to produce the final representation. I call it `newsHack`, and it monitors current web news from around the globe. When there's web news relevant to my client, based on my client's blog content history, `newsHack` pushes to my company's Slack a realistic client quote outlining the web news and my client's expert take. My agency uses the result to create future blog ideas for specific clients, bring back to life dusty blogs by updating their examples, improve pitch hooks with relevant and hot news, and augment data-driven insights into our PR campaign proposals. This is all enabled by RAG, which allows my IR system to produce conversational results with fewer hallucinations, up-to-date information, and extreme personalization. The first major issue I encountered during development was making the link between my client's documents and current web news. The solution was topic modeling the corpus to identify underlying themes using BERTopic. Finally, I use an LLM to generate an abstractive summary of the most representative documents, as well as a GDELT query. Effectively, `newsHack` monitors web news for each cluster within my client's corpus. %% First, my client's blog content is scraped from their website. The documents are processed (normalized and chunked), embedded, and stored in PostgreSQL (as the full blog) and Pinecone (as a chunk with embedding). To make the link between doc chunks and web news, I employ topic modeling techniques, specifically using Python's BERTopic library, to generate clusters, which helps me identify underlying themes in my clients blog history. Each cluster has a human-readable representation and a GDELT query generated by an LLM. Effectively, `newsHack` monitors web news for each cluster within my client's corpus. %% %% When `newsHack` identifies relevant news, down to the sentence level, it summarizes hundreds of similar quotes from the past 72 hours. To introduce my client's expertise onthe subject, a two-stage retrieval process is used, where the initial query is done by calculating the cosine similarity of dense vectors on the top_k 100 chunks. Finally, the results are reranked using a dedicated reranking model, although a simple BM25 calculation might do, with only the top 10 most relevant chunks passing. %% The next significant problem I encountered was with the recall of relevant client documents. I learned the poor recall was due to asymmetric semantic search, where my query did not resemble the document being retrieved in formatting, length, or style, even after normalizing and chunking. I found [[Hypothetical Document Embeddings (HyDE)]] significantly improved recall and subsequent LLM generation. %% LLMs are prompted to generate this hypothetical document, so rather than matching queries to documents, I began matching documents to documents.  %% %% Finally, in RAG applications, the retrieved and reranked results are reverse-packed—because research shows LLMs pay more attention to the end of the prompt—and compressed into summaries. The result is injected into the LLM to produce the final results representation. %% There are many more advanced techniques besides these, and an absolute solution does not exist, but that's why constant IR and RAG evaluation and tweaking are necessary. I can help Document Crunch build and maintain a unique and robust information retrieval system, including semantic search and RAG, that solves real-world business objectives. Thank you for your consideration, Ethan Young PS. No LLM was used in the writing of this letter. o1 version: ``` **Ethan Young** _(512) 810-2644 | [email protected]_ Over the past 18 months, I’ve bridged the technology gap in the Public Relations industry by integrating cutting-edge AI-powered search into my agency's daily workflow to deliver exceptional results for clients. I’m excited to bring this expertise to Document Crunch, helping your team create scalable, impactful solutions for construction contract management. I built a custom search system from the ground up, adhering to best practices in Information Retrieval (IR) and Retrieval-Augmented Generation (RAG). In PR, this system has transformed how we align our clients’ expertise with trending news, enabling them to navigate rapidly changing news cycles with precision. By identifying overlaps between journalists’ coverage and clients’ domains, the system notifies my agency via Slack of emerging narratives, complete with summaries and relevant client content. These insights are used to craft journalist pitches, inform blog ideas, and strengthen campaign proposals, ultimately increasing client visibility and credibility. This system was developed in just nine months while I managed full-time PR responsibilities across multiple accounts. I achieved this without a formal computer science degree, opting for a self-directed learning path after high school to maximize ROI on education and avoid student debt. My approach—leveraging just-in-time learning and practical application—has allowed me to master complex topics like AI, IR, and ML. For instance, I’ve built foundational knowledge of computer systems from logic gates up and stay current with the latest research to overcome challenges as they arise. I thrive in fast-paced, problem-solving environments and bring the grit, adaptability, and determination necessary to contribute to a small, innovative team like Document Crunch. I’m confident that my expertise in building and deploying AI-driven IR systems, combined with my passion for creating user-focused solutions, aligns with your mission to enhance document compliance in the construction industry. If you’d like to discuss the technical details of my work or how I can help Document Crunch build and optimize IR systems, including semantic search and RAG, feel free to reach out at (512) 810-2644 or [email protected]. ```