Here is the workflow for the analysis:
1. Starting with the DataThe foundation of the project was a weekly RSS feed comprising approximately 55 entries in HTML format. Each entry provided metadata (titles, links, descriptions) without full article content.
2. Streamlining Data RetrievalTo process this data efficiently, I used the CMS's RSS feed to extract entries into a single XML file. A script converted the data into a structured Pandas DataFrame, enabling easier analysis.
3. Enriching the DatasetTo determine which news items were impactful, I integrated market data (e.g., weekly S&P 500 returns, individual stock movements) via Yahoo Finance. Key metrics included:
- Weekly returns for individual stocks.
- Market-wide daily and weekly returns.
- Growth above market performance.
4. Building the LangChain IndexUsing LangChain, I created an index to enable semantic queries and contextual analysis of the dataset. This integration allowed for advanced retrieval and summarization using LLMs.
5. Insights and Trends AnalysisBy querying the enriched dataset, I extracted insights into stock trends, significant headlines, and recurring themes. For example, I identified key drivers of stock price changes, such as earnings surprises, regulatory challenges, and AI advancements.
If you'd like to see a more technical information and full description of the steps - please consult the README from this repo:
https://github.com/realmistic/long-term-news-llm-rag/blob/main/README.md