PYTHON
[|]
#Part 3

Sentiment Analysis of Financial News

Can you actually rely on the headlines to make a decision on whether to buy, sell or hold the stock?
Discussion in Telegram
Screencasts on Youtube
Articles on Medium
Code on Github

Introduction

Is there a relation between all that hype in the media regarding a certain company and how the company is actually doing? Can you actually rely on the headlines to make a decision on whether to buy, sell or hold the stock? In this article, we'll try to answer this question: can the stock market be influenced by the news? In particular, we will try to automatically get the list of news using a news API, apply sentiment analysis, and compare the results with the stock prices. Moreover, we will scale the approach: get daily news about stocks and compare its sentiment versus S&P 500 index performance.

This is the third part in the series on how to take advantage of computer technologies to make informed decisions in stock trading. The part 1 guided you through the process of setting up the working environment needed to follow along with the examples provided in the rest of the series. Then, in the previous part 2, you explored several well-known finance APIs, allowing you to obtain and analyse stock data programmatically.
Executive Summary
We tried to find the relevant financial news (searching for a company name) in the business and non-business newspapers by calling News API (NewsApiClient library). We took a glimpse at the VADER Sentiment Analysis library and the principles on an underlying paper. Then we looked at the visual co-movements of S&P500Index and general positive sentiment of all articles about stocks, and learned that business sources tend to be more correlated with S&P500 than articles from a general-purpose newspapers.

The Approach - Sentiment Analysis of News

If we are talking about a well-known company, then it's quite common that any significant thing about the company (a new contract, a new business line, a strong executive manager hired, mergers/acquisitions/partnerships etc.) or its financial results (quarterly and annual earnings, profits, earnings per share, etc.) are covered by the media. The idea is the following: we can automatically — with the help of a news API — check out news articles about a certain company, which were published within a specified interval, say, on the day before the company's annual general meeting, on the day of the event, and the next day.

Then, with the help of natural language processing (NLP) techniques, such as sentiment analysis, we can programmatically figure out what emotions prevail (positive, negative or neutral) in those articles. Since sentiment analysis provides a way to represent emotions numerically, you'll be able to compare the overall sentiment for a certain company for a specified period with the stock's price performance.

News coverage is far more than just a source of facts. Actually, news can shape our views of many things around us and finance is no exception. There is a possible connection between a stock's price jump and the news: either the news can cause the stock jump, or they can explain it afterwards. While it is hard to tell which news in particular had a strong influence on a stock's price, we suggest there are a lot of "emotional" traders who make a judgement based on the polarity of news coverage.

A Step-by-Step Guide on YouTube

News API

For the purpose of our project in this article, you can use the News API, which lets you get the most relevant news articles about stocks in general or about a certain company in particular. A request to the API is similar to a web search but allows you to narrow down the results being retrieved by specifying the publication interval for the required articles. The News API is easy to use (with direct HTTP request or Python wrapper library), although it has limitations in a number of calls (250 requests available every 12 hours) and only one month of historical data available for FREE.

Before you can use News API, you'll need to obtain an API key. This can be done for free at https://newsapi.org/. After that, you can start sending requests to the API. So, before going any further, let's make sure that the API works as expected.

As usual, create a new notebook in Google Colab. If you forgot how to do it, check with part 1. Then, install the newsapi-python Python wrapper for the News API in the notebook.
Py1. Install libs
!pip install newsapi-python
After that, insert and run the following code in a code cell:
Py2. Get the list of a relevant news associated with the search phrase
from newsapi import NewsApiClient
from datetime import date, timedelta

phrase = ‘Apple stock’
newsapi = NewsApiClient(api_key=’your_news_api_key_here’)
my_date = date.today() — timedelta(days = 7)
articles = newsapi.get_everything(q=phrase,
                                  from_param = my_date.isoformat(),
                                  language="en",
                                  sort_by="relevancy",
                                  page_size = 5)
for article in articles['articles']:
  print(article['title']+ ' | ' + article['publishedAt'] + ' | ' + article['url'])
This should give you the titles, publication dates, and links for 5 news articles about Apple stock, published in the last 7 days. There is also an article description which we don't print for now, but it will be used for the sentiment analysis.

So the output might look as follows:
Py3. Output example from the News API
Daily Crunch: Apple commits to carbon neutrality | 2020–07–21T22:10:52Z | http://techcrunch.com/2020/07/21/daily-crunch-apple-commits-to-carbon-neutrality/

Daily Crunch: Slack files antitrust complaint against Microsoft | 2020–07–22T22:16:02Z | http://techcrunch.com/2020/07/22/daily-crunch-slack-microsoft-antitrust/

Jamf ups its IPO range, now targets a valuation of up to $2.7B | 2020–07–20T17:04:25Z | http://techcrunch.com/2020/07/20/jamf-ups-its-ipo-range-now-targets-a-valuation-of-up-to-2-7b/

S&P 500 turns positive for 2020, but most stocks are missing the party — Reuters | 2020–07–21T19:45:00Z | https://www.reuters.com/article/us-usa-stocks-performance-idUSKCN24M2RD

Avoid Apple stock as uncertainties from coronavirus weigh on iPhone launch, Goldman Sachs says | 2020–07–23T13:50:13Z | https://www.businessinsider.com/apple-stock-price-rally-risk-coronavirus-iphone-delay-earnings-goldman-2020-7
As you can see there are only two directly related articles, but other three have mentioned the Apple company name inside and remain somewhat relevant.

You can also try to search something directly in the title, by passing the qInTitle param instead of q in the function call (documentation link), but there is a caveat that it is not implemented in the Python wrapper library and you will need to make HTTP request to the API instead of a simpler method.

Still the question remains open "What articles should be selected for the analysis and from what sources?"

VADER Sentiment Analysis

The crucial piece in the article is to understand what actually the sentiment analysis is and why it works here. The best source is the original article "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text".

I will summarise its main principles here:
Polarity of a text is summarised from the polarity of individual words (e.g. positive: love, nice, good, great — 406 words; negative: hurt, ugly, sad, bad, worse — 499 words).
Strength of sentiment and intensity is applied (e.g. degree modifiers: the good is extremely good).
Human-curated gold-standard resources: 20 people were hired to evaluate the predictions on the different types of text (tweets, reviews, tech, and news). Opinion news articles: included 5190 sentence-level snippets from 500 New York Times opinion editorials. VADER showed the highest correlation with human scores among all tested approaches on all types of text.
Context awareness: e.g. the word catch has negative sentiment in "At first glance the contract looks good, but there's a catch", but is neutral in "The fisherman plans to sell his catch at the market".
Punctuation: e.g. the exclamation point (!) and CAPITALISATION increase the magnitude of the intensity without modifying the semantic orientation. For example, "The food here is good!!!" is more intense than "The food here is good.".
Machine learning is used to improve all of the above (Naive Bayes classifier).
After applying these rules to the text one sentiment prediction is calculated, which is a value between -1 (strong negative) to +1 (strong positive).

Performing Sentiment Analysis for News

Let's create a new notebook for this project (a single script actually). To improve readability, we'll place the code within several code cells. In the first one, install the newsapi-python library in the notebook, just as you did for the test discussed in the previous section:
Py4. Installing the library
!pip install newsapi-python
You will also need to install yfinance library to access Yahoo Finance API covered in part 2:
Py5. Installing yfinance
!pip install yfinance
The next is the import section to include all the required libraries: This is what we can see for the FB symbol:
Py6. Doing the imports
import sys
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from newsapi import NewsApiClient
from datetime import date, timedelta, datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import yfinance as yf

sia = SentimentIntensityAnalyzer()
Let's download the lexicon needed to run the Vader sentiment analysis:
Py7. Downloading the Vader's lexicon
nltk.download(‘vader_lexicon’)
In another cell, make sure to set the following option for pandas to see the full output in Colab:
Py8. Setting the correct options to see the full output
pd.set_option(‘display.max_colwidth’,1000)
To start with, let's download some news on a search keyword for a specific date, sorted on relevancy for language = en, first 100 articles.

We'll use this function to call one end point to filter proper sources. Put it into another cell:
Py9. Getting the list of sources
def get_sources(category = None):
  newsapi = NewsApiClient(api_key='your api_key_here')
  sources = newsapi.get_sources()
  if category is not None:
    rez = [source['id'] for source in sources['sources'] if source['category'] == category and source['language'] == 'en']
  else:
    rez = [source['id'] for source in sources['sources'] if source['language'] == 'en']
  
  return rez
Let's now check out how many English (en) sources are available:
Py10. Total count of sources
len(get_sources())

# Output
# 81
And what we have business sources:
Py11. Get the list of the business news sources
#Get the list of the business news sources

get_sources('business')

# Output
# ['australian-financial-review',
# 'bloomberg',
# 'business-insider',
# 'business-insider-uk',
# 'financial-post',
# 'fortune',
# 'the-wall-street-journal']
Then, you define the function, which provides the implementation of the algorithm, calculating the sentiment figures.
Py12. Get the articles by the keyword and their sentiments
def get_articles_sentiments(keywrd, startd, sources_list = None, show_all_articles = False):
  newsapi = NewsApiClient(api_key='your_api_key_here')
  if type(startd)== str :
    my_date = datetime.strptime(startd,'%d-%b-%Y')
  else:
    my_date = startd
  #If the sources list is provided - use it
  if sources_list:
    articles = newsapi.get_everything(q = keywrd, from_param = my_date.isoformat(), to = (my_date + timedelta(days = 1)).isoformat(), language="en", sources = ",".join(sources_list), sort_by="relevancy", page_size = 100)
  else:
    articles = newsapi.get_everything(q = keywrd, from_param = my_date.isoformat(),to = (my_date + timedelta(days = 1)).isoformat(), language="en", sort_by="relevancy", page_size = 100)
  article_content = ''
  date_sentiments = {}
  date_sentiments_list = []
  seen = set()
  for article in articles['articles']:
    if str(article['title']) in seen:
      continue
    else:
      seen.add(str(article['title']))
    article_content = str(article['title']) + '. ' + str(article['description'])
    #Get the sentiment score
    sentiment = sia.polarity_scores(article_content)['compound']
  
    date_sentiments.setdefault(my_date, []).append(sentiment)
    date_sentiments_list.append((sentiment, article['url'], article['title'],article['description']))
    date_sentiments_l = sorted(date_sentiments_list, key = lambda tup: tup[0],reverse = True)
    sent_list = list(date_sentiments.values())[0]
    #Return a dataframe with all sentiment scores and articles  
    return pd.DataFrame(date_sentiments_list, columns=['Sentiment','URL','Title','Description'])
You can now perform some tests using the above function. First, we'll look at how it works for all news found for the keyword 'stock', for a certain date, and for ALL 'en' sources:
Py13. Get the aggregated statistics about the articles
return_articles = get_articles_sentiments(keywrd= 'stock', startd = '21-Jul-2020', sources_list = None, show_all_articles= True)
return_articles.Sentiment.hist(bins=30,grid=False)

print(return_articles.Sentiment.mean())
print(return_articles.Sentiment.count())
print(return_articles.Description)
As a result, you will see 100 articles with a lot of neutral sentiment, and it is skewed towards very positive.

This is what a fragment from the list of found articles might look like (top two negative articles):
Py14. Sorted output example (head) for the news sentiment API
return_articles.sort_values(by='Sentiment', ascending=True)[['Sentiment','URL']].head(2)

# Output:
# Sentiment    URL
# 58    -0.9062    https://www.reuters.com/article/india-nepal-palmoil-idUSL3N2ES1Y3
# 59    -0.8360    https://in.reuters.com/article/volvocars-results-idINKCN24M1D7
If you visit the first link (https://www.reuters.com/article/india-nepal-palmoil-idUSL3N2ES1Y3), you'll find: 'Nepal stops buying (New Dehli Suspended 39 oil import…)', which says it all.

You might want to look at the same list sorted in ascending order to see the articles with the highest sentiment ranks first:
Py15. Sorted output example (tail) for the news sentiment API
return_articles.sort_values(by='Sentiment', ascending=True)[['Sentiment','URL']].tail(2)


# Output:
# Sentiment URL
# 37 0.9382 https://www.reuters.com/article/japan-stocks-midday-idUSL3N2ES06S
# 40 0.9559 https://www.marketwatch.com/story/best-buy-says-sales-are-better-during-pandemic-stock-heads-toward-all-time-high-2020-07-21
From the article above: "TOKYO, July 21 (Reuters) — Japanese stocks rose on Tuesday as signs of progress in developing a COVID-19 vaccine boosted investor confidence in the outlook for future economic growth."

Let's now look for articles about the stock for the same date but in business sources only:
Py17. Aggregate statistics for the sentiment
sources = get_sources('business')
return_articles = get_articles_sentiments('stock','21-Jul-2020',sources_list = sources, show_all_articles = True)
return_articles.Sentiment.hist(bins = 30, grid = False)

print(return_articles.Sentiment.mean())

print(return_articles.Sentiment.count())

print(return_articles.Description)
This is what the output might look like, starting with the overall sentiment rank:
Py18. Mean-count-description for the articles
#Mean sentiment on 67 business articles
0.13

#Articles from the business sources
67

#Articles description examples
0 <ul>\n<li>Tesla CEO Elon Musk appears to have unlocked the second of his compensation goals on Tuesday. </li>\n<li>Despite a slight dip Tuesday, the company’s average market cap has been above $150 billion for long enough to unlock the second tranche of stock a…

1 <ul>\n<li>There’s a lot riding on Tesla’s second-quarter earnings report Wednesday afternoon.</li>\n<li>Analysts expect the company to post a $75 million loss for the three-month period ended June 31.</li>\n<li>Despite factory shutdowns and falling deliveries, t…

2 <ul>\n<li>Tesla reports its highly anticipated second-quarter earnings on Wednesday after market close. </li>\n<li>The report comes after the automaker’s second-quarter vehicle delivery numbers beat Wall Street expectations. </li>\n<li>Investors and analysts wil…
...
Figure-1: 21-Jul-2020, sentiment on 67 articles about stocks (business sources)
Figure-1: 21-Jul-2020, sentiment on 67 articles about stocks (business sources)
You can compare the results with a previous day, if you take all news about stocks:
Py19. All articles sentiment for the word 'stock' for 20-Jul-2020
return_articles = get_articles_sentiments('stock','20-Jul-2020',show_all_articles=True)
return_articles.Sentiment.hist(bins = 30, grid = False)
return_articles.Sentiment.mean()
Since we analyse all sources, you should find similar results from the first 100 articles on stocks:
Py20. Mean sentiment for 100 articles
#Mean sentiment on 100 articles
0.22501616161616164
Figure-2: 20-Jul-2020, Sentiment distribution for 1 day of stock news for all sources (top 100 articles sentiment)
Figure-2: 20-Jul-2020, Sentiment distribution for 1 day of stock news for all sources (top 100 articles sentiment)
It is more articles (100 vs. 67 from business sources), so the mean sentiment should contain more signals from various sources. The problem is that now that it can have smaller newspapers news (that don't have a wide audience of people who trade).

You may try to find the correlation of a stock price/ index price with other metrics like top negative score, top positive score, top negative — top positive, or median sentiment over all articles. We will continue using the mean() estimate for the rest of the code.

Now let's check the whole month: get top daily news and sentiments about the stock market from all sources and business newspapers:
Py20. Daily sentiments for the articles with a keyword 'stock' in the heading
#FREE NewsAPI allows to retrieve only 1 month of news data

end_date = date.today()
start_date = date(year=end.year, month=end.month-1, day=end.day)

print('Start day = ', start_date)
print('End day = ', end_date)

current_day = start_date
business_sources = get_sources('business')
sentiment_all_score = []
sentiment_business_score = []

dates=[]

while current_day <= end_date:
  dates.append(current_day)
  sentiments_all = get_articles_sentiments(keywrd= 'stock' ,
startd = current_day, sources_list = None, show_all_articles= True)
  sentiment_all_score.append(sentiments_all.mean())
  sentiments_business = get_articles_sentiments(keywrd= 'stock' , startd = current_day, sources_list = business_sources, show_all_articles= True)
  sentiment_business_score.append(sentiments_business.mean())
  
  current_day = current_day + timedelta(days=1)
You might want to compare the overall sentiment figures for the articles retrieved with and without 'business' category filter. For that, we'll create a pandas dataframe as follows:
Py21. Daily sentiments for the articles with a keyword 'stock' in the heading
sentiments = pd.DataFrame([dates,np.array(sentiment_all_score),np.array(sentiment_business_score)]).transpose()

sentiments.columns =['Date','All_sources_sentiment','Business_sources_sentiment']

sentiments['Date'] = pd.to_datetime(sentiments['Date'])

sentiments['All_sources_sentiment'] = sentiments['All_sources_sentiment'].astype(float)
sentiments['Business_sources_sentiment'] = sentiments['Business_sources_sentiment'].astype(float)
Before going any further, let's look at the structure of the dataframe we finally got:
Py22. Sentiment dataframe info
sentiments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 3 columns):
#   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
0   Date                     31 non-null  datetime64[ns]
1   All_sources_sentiment    31 non-null  float64
2   Business_sources_sentiment  31 non-null  float64
dtypes: datetime64[ns](1), float64(2)
memory usage: 872.0 bytes
Now let's make the Date column as an index to be able to join it with other data sources:
Py23. Setting Date column as an index
sentiments.set_index("Date", inplace=True)

sentiments.head()

Date   All_sources_sentiment    Business_sources_sentiment
2020-06-21    0.209889    0.111956
2020-06-22    0.219228    0.155876
2020-06-23    0.115508    0.102921
2020-06-24    0.084642    0.017751
2020-06-25    0.155524    0.005206
OK, now that we have daily sentiment figures for 1 month, why don't we compare them with real market figures for this same period, say, with S&P500 index?

Checking Daily Stock News Sentiment vs. Growth of S&P500 Index

As you might recall from the discussion on S&P500 index in part 2, it can be obtained as follows:
Py 24. Importing Pandas Datareader
import pandas_datareader.data as pdr

end = date.today()
start = datetime(year=end.year, month=end.month-1, day=end.day)

print(f'Period 1 month until today: {start} to {end} ')

Period 1 month until today: 2020-06-21 00:00:00 to 2020-07-21
Now we can obtain the index daily close prices:
Py 25. Get S&P500 2-days returns
spx_index = pdr.get_data_stooq(‘^SPX’, start, end)
spx_index.index

DatetimeIndex([‘2020–07–21’, ‘2020–07–20’, ‘2020–07–17’, ‘2020–07–16’, ‘2020–07–15’, ‘2020–07–14’, ‘2020–07–13’, ‘2020–07–10’, ‘2020–07–09’, ‘2020–07–08’, ‘2020–07–07’, ‘2020–07–06’, ‘2020–07–02’, ‘2020–07–01’, ‘2020–06–30’, ‘2020–06–29’, ‘2020–06–26’, ‘2020–06–25’, ‘2020–06–24’, ‘2020–06–23’, ‘2020–06–22’], dtype=’datetime64[ns]’, name=’Date’, freq=None)
In the next step, you might want to make a plot with the S&P500 data:

spx_index['Close'].plot(title='1 month price history for index S&P500 Index')
Figure-3: S&amp;P 500 index history, 21-June-2020 to 21-July-2020
Figure-3: S&P 500 index history, 21-June-2020 to 21-July-2020
Now let's join our sentiment data with S&P500 index data:
Py 26. Sentiment index vs. s&p500 close price
sentiments_vs_snp = sentiments.join(spx_index['Close']).dropna()
sentiments_vs_snp.rename(columns={'Close':'s&p500_close'}, inplace=True)
sentiments_vs_snp.head()

Date    All_sources_sentiment  Business_sources_sentiment    s&p500_close
2020-06-22    0.219228    0.155876    3117.86
2020-06-23    0.115508    0.102921    3131.29
2020-06-24    0.084642    0.017751    3050.33
2020-06-25    0.155524    0.005206    3083.76
2020-06-26    0.124339    0.008645    3009.05
How would both the sentiment from all news sources and S&P500 data look on the same plot (left axis = S&P500 index, right axis = Avg. news sentiment) ?
Py 27. Sentiment vs. S&P500 graph lines
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(rc={'figure.figsize':(13.0,8.0)})
ax=sns.lineplot(data=sentiments_vs_snp['s&p500_close'], color="b",label='S&P500 Close price')
ax2 = plt.twinx()
sns.lineplot(data=sentiments_vs_snp["All_sources_sentiment"], color="g", ax=ax2, label='All sources sentiment')
Figure-4: S&amp;P 500 vs. All sources sentiment (articles about 'stocks')
Figure-4: S&P 500 vs. All sources sentiment (articles about 'stocks')
You might also want to compare the sentiment figures obtained from the articles in the business category with the S&P500 data:
Py 28. S&P500 vs. Business articles (about 'stocks') sentiment
sns.set(rc={'figure.figsize':(13.0,8.0)})
ax=sns.lineplot(data=sentiments_vs_snp['s&p500_close'], color="b", label='S&P500 Close price')
ax2 = plt.twinx()
sns.lineplot(data=sentiments_vs_snp["Business_sources_sentiment"], color="g", ax=ax2, label='Business_sources_sentiment')
Figure-5: S&amp;P 500 vs. Business sources sentiment (articles about 'stocks')
Figure-5: S&P 500 vs. Business sources sentiment (articles about 'stocks')
As you can see, business sentiment figures look closer to the S&P500 data: they tend to move in the same direction.
Conclusion
By following the instructions provided in this first part of the series, you should have a Python environment installed on your local machine and have an initial understanding of how to run your Python code in Google Colab. In the next part, you'll start using those environments, obtaining stock data programmatically.

Do you find the article useful?

Do you like the content?

Leave your feedback on the article

For example, is it easy to understand?
For example, could you run the code?
For example, do you have idea to improve the article ?

Here you'll find the best articles from PythonInvest. Only useful digests, no spam.