PYTHON
[|]
#Part 16

Long-term Financial News with LLMs

Many of you may have noticed the rapid explosion of baseline large language models (LLMs) in 2023 and 2024, alongside an arms race among large companies to secure NVIDIA GPUs. Fast forward to early 2025, and while the pace of baseline model development has slowed, the focus has shifted. Companies are now moving further down the funnel—deploying these models on the cloud, selling subscriptions, and creating new services such as LLM-powered agents.

As an independent developer navigating this abundance of technology, I set out to find a meaningful use case for LLMs to build something truly useful. With a rich database of weekly financial news (published as 55 separate pages on a website), I decided to leverage LLMs for web scraping and retrieval-augmented generation (RAG) to make sense of it all.
Discussion in Telegram
Screencasts on Youtube
Articles on Medium
Code on Github
Author: Ivan Brigida
Data Analyst and Financial Enthusiast
Contributor: Prabin Kumar Nayak
Data Freelancer
NOT INVESTMENT ADVICE
The Content is for informational purposes only, you should not construe any such information or other material as legal, tax, investment, financial, or other advice.

Intro

With financial markets flooded by thousands of news articles every week, how can investors focus on what truly matters? This question inspired me to leverage large language models (LLMs) to analyze vast datasets of financial news and uncover actionable insights.
Over a year ago, I introduced a weekly update section on our website that summarizes over 5,000 financial news articles using AI.

The initiative, while valuable, raised important questions:
  • Which news truly matters?
  • How can we identify longer-term trends beyond weekly cycles?
  • Can we overcome coverage bias that skews headlines toward large companies?
  • How do we define the significance of financial news?
This article outlines how I tackled these challenges by incorporating market growth statistics and deploying advanced techniques such as Retrieval-Augmented Generation (RAG).
Executive Summary
The project focuses on identifying long-term stock trends by analysing weekly financial news for key companies across various industries. It highlights how core business performance, market expectations, and external challenges influence stock movements over time.

Earnings reports remain a primary driver of sharp price fluctuations, particularly when results deviate significantly from forecasts. Investors react strongly to signals of business health, such as new contracts, market expansion, or exceptional revenue and profitability results. Stock splits continue to generate positive momentum, often signalling management confidence and improving accessibility for investors.

On the other hand, regulatory challenges and legal risks negatively affect market sentiment. These uncertainties can overshadow otherwise stable performance, driving significant declines in valuation. The analysis also underscores the importance of clear business strategies and alignment with investor expectations; companies that fail to demonstrate resilience or growth potential often experience market underperformance.

The pipeline supporting this analysis integrates financial data with advanced processing to uncover nuanced trends. By focusing on business fundamentals and market dynamics, the project offers valuable insights into the drivers of stock performance, equipping stakeholders with actionable intelligence for decision-makin

Summary of Results

  • Quarterly Reporting
    Earnings reports were the primary drivers of significant price moves, especially when results deviated from expectations.
    01
  • AI Expectations
    AI remained a dominant theme, driving optimism for companies like Nvidia (+11.43%) and Palantir (+21.43%, +40.42%). However, risks such as competition affected others, like Alphabet (-5.63%).
    02
  • Stock Splits
    Stock splits created positive momentum, signaling management confidence and improving share accessibility (e.g., Nvidia on June 10th 2024).
    03
  • Regulatory Challenges
    Legal issues or regulatory scrutiny had a significant negative impact (e.g., Acadia Healthcare: -17.79%).
    04
  • Core Business Performance
    Strong fundamentals drove growth for companies like Amazon (+5.41%), while weak guidance caused declines (e.g., Tesla: -7.95%).
    05

The Weekly News Database

Original Data Source:
The data comes from the Polygon.AI News API, which provides a large volume of individual news items. Each entry includes only metadata and a headline. These raw news items are then summarized weekly. I’ve detailed the summarization process using OpenAI’s ChatGPT in this article: Leveraging OpenAI's API for Financial News Summarization.

Example Overview of a One-Week Publication:
You can view examples from any week here: Weekly Financial News Feed.

Here’s how the weekly summaries are structured:
Part 1: Individual Ticker Coverage
This section highlights coverage for approximately 10 individual tickers per week. Frequently mentioned companies—such as NVDA, AMZN, TSLA, and AAPL—appear almost every week, while others are included only during weeks when significant news arises.
Each news item summarized in this section is associated with a single ticker, as determined by the Polygon.AI classifier, which ensures the article focuses exclusively on one company.

Part 2: Broader Market Summaries
This section includes:
  • A 1-day summary of market news, typically capturing a single full trading day (often Monday).
  • A 1-week summary, which occasionally extends to cover up to a month.
These summaries incorporate a much larger volume of news—around 2,000 articles per week—categorized by verticals, investment trends, or groups of companies. This section is particularly useful for uncovering unconventional updates and gaining insights into smaller companies that, while not always in the spotlight, remain significant players in their industries.

Potential Biases

The results should be taken with a grain of salt.

I don't recommend treating this information as the ultimate source of truth for real-time trading decisions. However, I believe it generally aligns with market logic and reflects the fundamental principles of the financial ecosystem.

Here is a list of potential biases that come to mind:
  1. Limited Original Data: The data provided by the Polygon News API includes only the title and metadata, without the full content of the articles.
  2. Classifier Accuracy: The system uses classifiers to determine which company is mentioned in the news. These classifications may be incorrect or include multiple companies per article.
  3. Language and Market Scope: Our analysis is mostly limited to English-language sources focused on the U.S. market, rather than global coverage.
  4. Coverage Bias: Popular companies often have 10–100 articles per week. These articles are summarized by LLMs (primarily GPT-4) into just a few sentences, so the quality and accuracy of the summaries are not guaranteed.
  5. Inconsistent Timeframes: Some weeks may have missing data, and individual articles occasionally fall outside the intended 7-day window, covering up to 10–14 days instead.
  6. Supplementary Data: We rely on growth-above-market statistics from Yahoo Finance to guide attention toward the most "impactful" articles.
  7. Indexing Limitations: LangChain relies on specific fields, such as "TICKER" and "DATE PERIOD," for indexing news. This requires the LLM to correctly identify and focus on relevant fields in the query to return the right results.
  8. RAG Filtering Constraints: Retrieval-Augmented Generation (RAG) limits the number of news articles returned during filtering to a maximum of seven (usually providing exactly seven results, when possible).
  9. Double Data Processing: The seven filtered results are further aggregated and summarized into long-term outcomes, introducing an additional layer of data processing that may compound inaccuracies.

The Approach


Here is the workflow for the analysis:
1. Starting with the Data
The foundation of the project was a weekly RSS feed comprising approximately 55 entries in HTML format. Each entry provided metadata (titles, links, descriptions) without full article content.
2. Streamlining Data Retrieval
To process this data efficiently, I used the CMS's RSS feed to extract entries into a single XML file. A script converted the data into a structured Pandas DataFrame, enabling easier analysis.
3. Enriching the Dataset
To determine which news items were impactful, I integrated market data (e.g., weekly S&P 500 returns, individual stock movements) via Yahoo Finance. Key metrics included:
  • Weekly returns for individual stocks.
  • Market-wide daily and weekly returns.
  • Growth above market performance.
4. Building the LangChain Index
Using LangChain, I created an index to enable semantic queries and contextual analysis of the dataset. This integration allowed for advanced retrieval and summarization using LLMs.
5. Insights and Trends Analysis
By querying the enriched dataset, I extracted insights into stock trends, significant headlines, and recurring themes. For example, I identified key drivers of stock price changes, such as earnings surprises, regulatory challenges, and AI advancements.

If you'd like to see a more technical information and full description of the steps - please consult the README from this repo: https://github.com/realmistic/long-term-news-llm-rag/blob/main/README.md

Long Term Business Insights

Number of Companies Covered
The dataset captures 26 companies with significant weekly stock moves across various industries, ranging from tech giants to retail and specialty firms.

Key Trends and Tags
The main reasons behind the stock moves are categorized into the following tags (with examples):
Quarterly Reporting (7 mentions):
  • Positive: Target (+22.08%), MU (+12.89%), Adobe (+10.79%)
  • Negative: Salesforce (-13.10%), Rivian (-34.36%)
  • Reporting performance consistently caused significant volatility, as earnings and forecasts heavily influence investor sentiment.

AI Expectations and Developments (8 mentions):
  • Positive: NVDA (+11.43%), AMD (+5.76%), Palantir (+21.43%, +40.42%)
  • Negative: Alphabet (-5.63%)
  • AI was a major driver of optimism, particularly for companies like NVDA, PLTR, and AMD, but Alphabet faced risks related to AI competition.

Stock Splits (3 mentions):
  • Positive: Nvidia (+12.13%), Broadcom (+23.37%), SMCI (+9.57%)
  • Stock splits generally brought positive momentum, signaling confidence from the companies while improving accessibility for investors.

Regulatory Challenges and Legal Issues (8 mentions):
  • Negative: Alphabet (-4.17%), CRWD (-11.07%), ACHC (-17.79%), CELH (-13.54%)
  • Companies facing lawsuits or regulatory scrutiny experienced sharp declines due to increased uncertainty and risks.

Core Business Performance (10 mentions):
  • Positive: Amazon (+5.41%), Verizon (+6.50%), ASML (+7.62%)
  • Negative: Tesla (-7.95%), Intel (-9.08%)
  • Companies with steady growth in their main business areas gained, while weaker performance or guidance led to declines.

Biggest Movers
Positive Outliers:
  • Palantir: +21.43% and +40.42% in separate weeks, driven by strong earnings and its AI platform.
Negative Outliers:
  • Rivian: -34.36% due to weak guidance, production issues, and workforce reductions.
  • Acadia Healthcare: -17.79% amid a lawsuit regarding deceptive practices.

Common Factors Across Stocks
  • AI and Innovation: Significant contributors to positive sentiment across various sectors (e.g., NVDA, AAPL, META).
  • Valuation and Guidance Risks: High valuations and poor guidance consistently caused declines (e.g., Tesla, CELH, Rivian).
  • Market Trends: Companies benefiting from broader market shifts, such as AI adoption or new product launches, generally outperformed.
Recurring Themes
Some companies like Nvidia, Broadcom, and Palantir repeatedly surfaced due to their innovation and ability to capitalize on growing sectors like AI. On the flip side, companies like Tesla and Alphabet faced challenges from regulatory pressures and valuation concerns despite their strong presence in the market.

This is the main result of the article. You can check the raw output and all queries to the knowledge database in this interactive notebook: https://github.com/realmistic/long-term-news-llm-rag/blob/main/notebooks/04_RAG_from_content.ipynb
Conclusion
This pipeline transforms unstructured RSS feed data into a rich, queryable dataset, offering valuable insights into long-term financial trends. By integrating advanced LLMs and market context, it provides a scalable framework for analyzing stock performance and identifying key drivers of market dynamics.
As financial markets evolve, this approach highlights the potential of AI to democratize financial analysis and deliver actionable intelligence to investors.

Do you find the article useful?

Do you like the content?
Consider to make a donation
Whether you're grateful for the content (Buy Me A Coffee page), or you wish to support me coding (GitHub sponsorship page)

Leave your feedback on the article

For example, is it easy to understand?
For example, could you run the code?
For example, do you have idea to improve the article ?

Here you'll find the best articles from PythonInvest. Only useful digests, no spam.