建立新闻内容的数据库
News has always been a very significant part of our society. In the past, we mostly depended on the news channels and newspapers to get our feeds and keep ourselves updated. Currently, in the fast-paced world, news media and agencies have started using the internet to reach the readers. The venture has proven to be very helpful as it has allowed the houses to extend their reach among readers.
新闻一直是我们社会非常重要的一部分。 过去,我们主要依靠新闻频道和报纸来获取供稿并保持最新状态。 当前,在快节奏的世界中,新闻媒体和代理机构已开始使用互联网来吸引读者。 事实证明,这项冒险活动非常有帮助,因为它使房屋可以扩大读者的视野。
In the present world, there are numerous media outlets, so, it can be easily established that it is impossible for a person to go and gather news from all the outlets, owing to the busy life schedules. Besides, each media outlet covers each story differently. Some readers like to compare stories and read the same story from multiple houses to get the full idea of an event. All these requirements are solved by a type of application that is gaining popularity currently, Online News Distribution applications. These applications aim to gather news from multiple sources and provide to a user as a feed. In this article, we will look at an approach toward building such an application.
在当今世界上,有许多媒体渠道,因此很容易确定,由于生活繁忙,一个人不可能从所有渠道收集新闻。 此外,每个媒体都以不同的方式报道每个故事。 一些读者喜欢比较故事,并从多个房屋中读取同一故事,以获取事件的完整信息。 所有这些要求都可以通过一种目前正在流行的应用程序(在线新闻发布应用程序)来解决。 这些应用程序旨在从多个来源收集新闻并作为提要提供给用户。 在本文中,我们将研究构建此类应用程序的方法。
想法 (The Idea)
The main component of such an application is the news of course. I have used four of the most popular media houses in India for the application, to serve as the sources. All of the media houses possess their own website, from where we scrape the headline links and the stories. We will use the extractive text summarization to extract the gist points from the stories in 3 to 5 sentences. We will store the information collected along with the sources, i.e, the names of the publishing media houses, date, time, and title of the story in datewise files. Each datewise file will give the feed of that particular date.
这种应用程序的主要组成部分当然是新闻。 我使用了印度最受欢迎的四家媒体公司作为该应用程序的资源。 所有媒体公司都拥有自己的网站,我们从中抓取标题链接和故事。 我们将使用摘录文本摘要从3到5个句子中提取故事的要点。 我们将把收集到的信息与来源一起存储,例如,发布媒体公司的名称,日期,时间和故事的标题,保存在按日期排列的文件中。 每个按日期排列的文件都将提供该特定日期的提要。
Now, we can extract another piece of information from the story title, that is the subject of the story. Each title has some relevant information, it may be the name of a person, a country, an organization, or any important topic of that time, for instance, COVID-19. The names or topics are mostly the subjects of the story. We will be extracting these words of interest from the title and we will be using them as labels or tags for the corresponding stories. We will store these labels also along with the titles in the files.
现在,我们可以从故事标题中提取另一条信息,那就是故事的主题。 每个标题都有一些相关信息,它可以是一个人,一个国家,一个组织的名称,或当时的任何重要主题,例如COVID-19。 名称或主题主要是故事的主题。 我们将从标题中提取这些感兴趣的单词,并将它们用作相应故事的标签或标记。 我们还将这些标签以及标题存储在文件中。
An app can be used by many users of different types, so, we must create a filtering or recommender mechanism to customize a user’s feed according to his/her interests. For this, we will need to create a login system, to separately record the type of stories each user reads, and recommend to him/her only based on his/her account. We will be maintaining a database that will contain the user’s name, email, phone number(optional), and password. The email will be our unique key here.
一个应用可以被许多不同类型的用户使用,因此,我们必须创建过滤或推荐机制来根据用户的兴趣来自定义其供稿。 为此,我们将需要创建一个登录系统,以分别记录每个用户阅读的故事类型,并仅根据其帐户向其推荐。 我们将维护一个包含用户名,电子邮件,电话号码(可选)和密码的数据库。 电子邮件将是我们此处的唯一密钥。
We will also be maintaining two JSON files, one to record the stories each user reads and the corresponding labels. In this case, we use the user’s email as the key. The labels will keep telling us the topics the user is interested in. The other file records the users who read a story. In this file, we form a unique key in the format:
我们还将维护两个JSON文件,一个用于记录每个用户阅读的故事以及相应的标签。 在这种情况下,我们使用用户的电子邮件作为密钥。 标签将不断告诉我们用户感兴趣的主题。另一个文件记录了阅读故事的用户。 在此文件中,我们形成以下格式的唯一键:
Publishing House+$+ Publishing Date+$+Story Title
出版社+ $ +出版日期+ $ +故事标题
This unique key will be used as the key in our JSON file. Each key will have the emails of the users who read the story. The idea behind this is, the labels attached in the user’s file to each email will allow us to do content-based recommendations, and if we use both the files together, we can create a full user-item interaction matrix, which can be used to create collaborative filtering based recommendations.
此唯一密钥将用作我们的JSON文件中的密钥。 每个密钥都将包含阅读该故事的用户的电子邮件。 其背后的想法是,用户文件中附加到每封电子邮件的标签将使我们能够进行基于内容的推荐 ,如果我们将两个文件一起使用,则可以创建一个完整的用户项交互矩阵,该矩阵可用于创建基于协作过滤的建议。
Now, we can offer the user three types of distributions of news:
现在,我们可以为用户提供三种新闻发布类型:
- Latest Feed: The fresh feed for every day 最新饲料:每天新鲜的饲料
- Most Popular stories 最受欢迎的故事
- Customized Feed: May contain unvisited feed from the last 2–3 days but will be tuned according to the user’s interests. 自定义的Feed:可能包含最近2-3天未访问的Feed,但会根据用户的兴趣进行调整。
One thing worth noticing is the Latest feed is neither tuned nor popular most, still, it is essential, in order to make sure all the stories reach a user and to ensure a bit of randomness, or the entire thing will be too biased. The latest story will be the current date’s feed only. We will use the JSON file containing the records of the emails of all users who visited the story for each story, to obtain the popularity of the story. The popularity of a story is simply the total length of the record of emails for the story.
值得注意的一件事是,最新Feed既不调优也不最受欢迎,这对于确保所有故事都能传达给用户并确保一定的随机性还是至关重要的,否则整个事情都会产生偏差。 最新的故事将仅是当前日期的提要。 我们将使用JSON文件,其中包含访问每个故事的故事的所有用户的电子邮件记录,以获取故事的受欢迎程度。 故事的受欢迎程度只是该故事的电子邮件记录的总长度。
The next thing is we must do is, add a search option. We as readers often want to read about a particular topic. This option will help our users to use the feature.
接下来我们要做的就是添加搜索选项。 作为读者,我们经常想阅读特定主题。 此选项将帮助我们的用户使用该功能。
Lastly, we need to give a “similar stories” option. If we visit an e-commerce site, if we buy a product, it shows us similar products to ease the browsing for the user. We will use a similar feature. If a user selects to read a particular story, we will show him/her similar stories, in order to make his/her experience better.
最后,我们需要给出“类似的故事”选项。 如果我们访问电子商务网站,如果我们购买产品,它会向我们显示类似的产品,以简化用户的浏览。 我们将使用类似的功能。 如果用户选择阅读特定的故事,我们将向他/她显示类似的故事,以使他/她的体验更好。
We have seen the whole idea, now, let’s jump into the application part.
现在,我们已经了解了整个想法,让我们进入应用程序部分。
应用 (Application)
Let’s first see how the news websites look and how can we easily scrape the required data.
首先让我们看看新闻网站的外观,以及我们如何轻松地抓取所需的数据。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/122219.html