The GDELT Project - Ethan Young

# The GDELT Project https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/ https://gdelt.github.io/#api=doc&query=&contentmode=ArtList&maxrecords=75&timespan=1d https://github.com/gdelt/gdelt.github.io?tab=readme-ov-file https://github.com/geotheory/gdelt-shiny Apparently there is a 30 min rolling window with articles embedded for semantic search. ## A Global Database of Society Supported by [Google Jigsaw](https://jigsaw.google.com/), the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world. ## Databases / Tables / Endpoints GDELT 2.0: - [[GDELT Global Knowledge Graph (GKG) API]] (graph SVOs, no API) - might be accessible as 'gkg' table in gdelt python package, ex. `results = gd2.Search(date_range, table='gkg', output='pd')` - [[GDELT Event Database]] (return worldwide events) - [[GDELT Mentions]] (return articles mentioning events) - [[GDELT Doc API]] (search article body and return articles) - [[GDELT Context API]] (search article body and return articles + quotes) - [[The GDELT GEO API]] (search article body and return articles + location; map) - [[GDELT TV API]] (search KWs for TV stats) # Constructing a Narrative with GDELT - **Real-time Monitoring:** First, we'll use the **GDELT Context 2.0 API** to find articles that are relevant to the client's interests. This endpoint uniquely includes one sentence matching the query and another sentence adding context, both of which are embedded and compared to the client corpus via cosine similarity, with only the highest scoring articles being returned. %% - **Enriching Real-time Monitoring with Historical Data:** Use the **DOC API** to search for articles related to your client's interests dating back to 2017. **Challenge**: The DOC API does not provide sentence-level matches. For a subset of highly relevant articles (based on metadata), retrieve the full text and extract sentences or summaries for embedding and similarity comparison. This may not be the right API endpoint. And I think the real-time via Context API will be more comprehensive in terms of coverage because of `maxrecords`. %% - **Linking Articles to Events:** Use the article URLs or titles from both APIs to query the **Mentions Table** and obtain the `GlobalEventID`s. This allows you to link both recent and historical articles to their corresponding events. - **Retrieving Event Details:** With the `GlobalEventID`s from both recent and historical data, retrieve detailed event information from the **Event Table**. - **Building a Narrative:** Combine events from both the Context API (recent events) and the DOC API (historical events) to build a narrative that shows the evolution of topics over time. Identify patterns, trends, or recurring themes that are significant to your client's interests with LLMs. https://www.gdeltproject.org/data.html#documentation Or use (`https://api.gdeltproject.org/api/v2/summary/summary`): ``` https://api.gdeltproject.org/api/v2/summary/summary?d=web&t=summary&k=Donald+Trump&ts=full&fsc=US&fsl=eng&svt=zoom&stc=yes&sta=list&c=1 ``` for overview of a few APIs. if OpenAI's text-embedding-3-small costs $0.020 / 1M tokens, how much would it cost to embed 5 million headlines if each headline is an average of 11 tokens? $1.1 \ 6 cores, 32 GB RAM: 1-month history, or 100K links, retrieved and filtered in 3 seconds. 3K outlets 1-year history, or 1.5M links, retrieved and filtered in 24 seconds. 8k outlets 5-year history, or 5M links, retrieved and filtered in 90 seconds. 8k outlets 9-year history (max), or 17M links, retrieved and filtered in 275 seconds. 28k outlets if text-embedding-3-small is $0.020 / 1M tokens, how much would it cost to embed 5 million headlines if each headline is an average of 11 tokens? $1.10 lol if each link is four chunks of 256 tokens, gpt-4o-mini is $0.150 / 1M input tokens, then chunking, embedding, and summarizing would cost ~$15