contactAnalysis MVP - Ethan Young

# contactAnalysis MVP *Use semantic analysis to search beyond today’s keyword matching systems.* Search engines traditionally work by searching for overlaps of keywords. By leveraging vector embeddings, semantic search can go beyond keyword matching and deliver based on the query’s semantic meaning. https://github.com/ethanpy3/analyzeContacts/commits/master ## Tech Stack - IDE: PyCharm - Backend: [[Django ]] - Frontend: [[React.js]] ALT: [[svelte vs solid]] - Traditional Database: PostgreSQL - Vector Database: Pinecone - Backend & Frontend Deployment & Hosting: [[Heroku]] - Version Control: Git & GitHub - Environment & Dependency Management: Pipenv - Other API(s): GPT-4, text-embedding-ada-002 ## Key Functionalities: ### 1. CONTACT [[CRUD Operations]] > (`contacts/views.py/ContactsView` and `ContactsDetailView`) JSON (for individual contacts) or .CSV (for many contacts) import to PostgreSQL database. Note contact fields include First name, Last name, Email, Publication, Role, and Industry. Import automatically triggers CREATE, STORE EMBEDDING. ### 2. CONTACT LIST [[CRUD Operations]] > (`contacts/views.py/ContactListsView`, `ContactListDetailView`, and `AddContactsToListView`) Contact lists have a many-to-many relationship. If the list does not exist, a new list is created. ### 3. CREATE, STORE CONTACT [[Vector Embeddings]] > (`contacts/models.py/create_and_index_embedding()`) A receiver function calls the following: #### EMBEDDING CREATION > (`vectors/services.py/EmbeddingService.py/generate_embeddings()`) Generate, with text-embedding-ada-002, a contact embedding. Embedding algorithm is fed combined text of "role" + "industry" fields for each contact. #### EMBEDDING STORAGE > (`vectors/services.py/IndexingService.py/populate_index()`) Upsert() contact embeddings into an index, the highest-level organizational unit of vector data in [[Pinecone]]. A new index is created if one does not already exist. ### 4. VECTOR OPERATIONS functions / documentation should be refactored to reflect this language. #### [DescribeIndexStats](https://docs.pinecone.io/reference/describe_index_stats_post) returns info about index (vector count per namespace, number of dimensions) #### [Query](https://docs.pinecone.io/reference/query) > (`vectors/services.py/SearchService.py/search_index()` and `vectors/views.py/SearchView`) searches a namespace using a query vector; retrieves the ids of the most similar items with their similarity scores #### [Delete](https://docs.pinecone.io/reference/delete_post) deletes vectors, by id, from a single namespace #### [Fetch](https://docs.pinecone.io/reference/fetch) returns all vectors by ID in a namespace and/or metadata #### [Update](https://docs.pinecone.io/reference/update) updates vectors and/or metadata in a namespace #### [Upsert](https://docs.pinecone.io/reference/upsert) writes vectors into a namespace ### 5. INDEX OPERATIONS #### [list_collections](https://docs.pinecone.io/reference/list_collections) returns a list of your Pinecone collections. **Not supported by projects on the gcp-starter environment.** #### [create_collection](https://docs.pinecone.io/reference/create_collection) creates a Pinecone collection. **Not supported by projects on the gcp-starter environment.** #### [describe_collection](https://docs.pinecone.io/reference/describe_collection) gets a description of a collection. **Not supported by projects on the gcp-starter environment.** #### [delete_collection](https://docs.pinecone.io/reference/delete_collection) deletes an existing collection. **Not supported by projects on the gcp-starter environment.** #### [list_indexes](https://docs.pinecone.io/reference/list_indexes) returns a list of your Pinecone indexes #### [create_index](https://docs.pinecone.io/reference/create_index) creates a Pinecone index #### [describe_index](https://docs.pinecone.io/reference/describe_index) gets a description of an index. #### [delete_index](https://docs.pinecone.io/reference/delete_index) deletes an existing index. #### [configure_index](https://docs.pinecone.io/reference/configure_index) specifies the pod type and number of replicas for an index. **Not supported by projects on the gcp-starter environment.** ### 6. [[EMBEDDING PROJECTION]] Generate 2D and 3D projections of the entire dataset—or subset upon query—using TensorFlow's TensorBoard projector. - perhaps the plotting could be color coded by % match: green to red, sharp gradient changes by 10% [[Pasted image 20230930144822.png]] - or, if we change to uploading metadata by industry, then color coding that, too ### 7. DATA EXPORT Export .CSV contact lists (after query) and .PDF projections. ### 8. [[Retrieval Augmented Generation (RAG)]] ## NOTES - [[Design, Develop, & Deploy a Python Application]] - [[LLMs and Embeddings]] - [[Integrating Django & React.js]] - [[Django]] - [[Django Rest Framework (DRF)]] - [[Django's Object-Relational Mapping (ORM)]] - [[Django Models]] - [[Integrating Django & React.js]] - https://docs.pinecone.io/docs/openai - https://platform.openai.com/docs/quickstart/build-your-application - [[OP Stack]] - [[Common Uses of Vector Embeddings]] ## PROGRESS [[project prompt]] ### Bugs: - [[same contact two different IDs]] - reate_serverless_index() logic broken if index already exists, probably because `pc.list_indexes()` is not up to spec - services.py line 250 (this may be fixed and I forgot to update - I migrated to serverless) ### New feature(s): #### General: - implement config file, then Trulens, or some other feedback functions [[Building Production-Grade LLM or RAG Apps]] - Integrate with bigger contact DB API, also figure out how to find recent work / posts. Is RSS enough? #### Contacts: - Implement filtering and searching contacts and contact lists on the backend views. - Create a new API endpiont or modify an existing one to accept PostgreSQL searches (in contacts/services.py and views.py) - Create a new API endpoint or modify an existing one (like api/contact-lists/) to accept a contact ID and return all lists that include this contact. OR Add a field in the contact model that references the lists the contact is a part of. - Consider adding fields like created_at and updated_at to both contact and contact list models for better record-keeping and management. - If there's a need to store additional metadata about the relationship (e.g., date when a contact was added to a list), consider using a through model for the ManyToManyField. - To provide more context in your API responses, you might use nested serializers. For example, when retrieving a `Contact`, you might want to include the names of the lists they are part of. #### Functions: - Clustering (where text strings are grouped by similarity) - Recommendations (where items with related text strings are recommended) - [[Common Uses of Vector Embeddings]] - [[Support User Crud Operations]] - add logic for delete_embedding_vectors: if want to delete all vectors in index, delete index instead of batching - I can't find a way to find VECTOR_COUNT - add Data freshness check to query to ensure Pinecone data is consistent (only necessary RIGHT after an upload, though) #### Optimizing: - [[Use transaction.atomic() to roll back changes made in DB after an error]] - [[Refactor VectorOperations to handle cases where vectors do not exist]] - [[Refactor logger methods to have levels]] - [[Implement error handling]] - [[Refactor code to use <with> when initializing Pinecone]] #### Back-burner: **The principles of premature optimization:** Don't optimize further unless there's a clear need to. Begin with the simplest solution that works, then profile and optimize based on your application's real-world performance and requirements. - [[Refactor Upsert and Query to use metadata (paid)]] - [[Dense Vectors & Sparse Vectors]] - [[Refactor ID Generation]] - [Retrying with exponential backoff](https://platform.openai.com/docs/guides/rate-limits/retrying-with-exponential-backoff) - [send upserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel #### Pivot: - https://developers.facebook.com/docs/instagram-api/reference/ig-user/insights/ - RAG for pitch generation (presupposing recently written article api connection) ## UNKNOWNS - Am I indexing dense or sparse vectors, and how will that effect search? - [[understanding top k]] - [[Through Model in Django]] - [[select_related and prefetch_related in Django]] - [[Implementing Filtering and Searching on Backend vs Frontend]] - [[integrate with larger contact data apis]] - [[Asynchronous Operations]]: Right now, importing a csv takes about 45-60 seconds to complete, and there's no HTTP response until then. - [[prompt - asynchronous operations]] - [[UI]] - react vs svelte???? - Understanding feedback functions - [Should I use one of the newer embedding models](https://openai.com/blog/new-embedding-models-and-api-updates)? ## LINKS - [ChatGPT clone](https://chat.patrikzudel.me/) - [My dev server](http://127.0.0.1:8000/) - [My Github commits](https://github.com/ethanpy3/analyzeContacts/commits/master) - [OpenAI docs](https://platform.openai.com/docs/overview) - [OpenAI API reference](https://platform.openai.com/docs/api-reference) - [My Pinecone indexes](https://app.pinecone.io/organizations/-NemjuoGX_l3gBGnd7rC/projects/gcp-starter:0827f1a/indexes?sessionType=login) - [Pinecone docs](https://docs.pinecone.io/docs/overview) - [Pinecone API reference](https://docs.pinecone.io/reference/describe_index_stats_post)