承接前文:《机器学习在出版业的应用(一)》。这是一篇笔者在计算机学习中作的作业文章,综合了目前一些机器学习在出版业可能的应用方向。目前,虽然数字出版如火如荼,图书数字化已是初具规模,但是真正的将现代科技应用到出版的却是不多。本文在此希望能够将机器学习这样新的现代技术引入出版这个古老的产业,也希望两者能碰撞出一些新的火花。本文为英文写作,由于篇幅比较长,所以分三篇全文摘抄于此。此文为(二)。
Machine Learning in Publishing Application(cont.)
3 Helping the Publishers
As an editor in a publishing company, not only do you have to do a creative job, but you also have to deal with a lot of detailed and repetitive work. For example, proofreading, printing, and distribution, which still rely heavily on human input in the publishing process, with the help of powerful computers, these jobs can be assisted or even replaced by machines. And for the huge amount of information about consumers and market products, it is a new challenge to handle this data.
3.1 Proofreading
Proofreading is a very important job for publishing activity. Proofreading requires very careful checking and correction of the manuscript, and for the ordinary people it is a profession that requires very specialized training and has a high entry barrier: proofreaders should not only have a good education in literature and writing skills but also need to have a good understanding of the background of the manuscript. Therefore, a good proofreader may not be suitable for all proofreading jobs. For example, faced with a manuscript on mathematics, a proofreader with a background in mathematics may be more suitable than a proofreader with a background in literature or history, although it is possible that the proofreader with a background in literature or history may have better writing skills than the proofreader with a background in mathematics. However, computers do not have such problems.
In our daily editing work, we often use intelligent proofreading software programs or functions, like Word’s text correction, Proofreading extensions for Chrome, and Grammarly. In the author’s previous work in the proofreading of books or articles written in Chinese, we have some professional proofreading software like Black Horse Proofreading (Figure 3), or Founderss. Black Horse Proofreading is the application of Chinese word separation in the context of Natural Language Processing (NLP) and a massive thesaurus. Although these proofreading software programs continually introducing artificial intelligence, they are not perfect and still need a lot of human resources for the final proofreading work. For NLP problems, now, deep learning is a relatively popular research direction. At Sandford University, there is also a course which name is Deep Learning for Natural Language Processing. Therefore, technologies such as deep learning and data mining can be further developed and applied to proofreading software to make proofreading jobs more effective and easier. Ideally, with the help of analysis of natural language grammar, combined with corpus statistics and cross-referencing, proofreading in publishing can all be left to artificial intelligence.
3.2 Precise Marketing
In the production activities of book publishing, except for the revision and touch-up of the manuscript, most of the work is market-related, including readership data analysis, book recommendations, market prospects, printing quantity determination, etc.
3.2.1 Readership data analysis
For traditional publishers, the interest of the readers has always been a very difficult point to catch up with. Because it has always been an impossible technical task to track readers’ preferences on a large-scale market and time, publishing companies have relied on market researches to select the books that they believe will resonate with readers. For many years, these market researches are based on book club profiles, interviews, or surveys. But these traditional ways cannot sufficiently and scientifically capture the readers’ preferences.
However, with the rise of e-readers and online reading, behavioral data related to reading will present a great opportunity for publishing companies as people not only read online but also engage in online criticism and discussion of the books, authors could also have direct communication with their readers. All this behavioral data holds great power. Machine learning can then be applied to the analysis of this data. For publishing companies, data from digital platforms can be used to track reader behavior, which in turn helps them to make decisions that are more responsive to market needs.
3.2.2 Book recommendations
The use of machine learning to analyze book content to better serve reading and publishing has been around for a long time. Since 2010, many companies such as BookLamp, Trajectory, and Intellogo, are all tried to use big data to analyze books’ content and readers’ behavior, and get writing style, rhythm, and emotion of existing books through machine learning, so as to recommend books for readers more accurately. In Amazon’s online store, automatic recommendations based on artificial intelligence are already widely used. Amazon’s recommendation process is currently based on content matching with keywords and metadata, as well as personalized recommendations based on algorithmic comparison of the behavior of other customers who have “also purchased similar books”. We can also find some projects on the internet like “Building a Content-Based Book Recommendation Engine” or “Book Recommender with Python”.
For publishers, although the marketing of books can be delegated to third-party companies or book sellers like Amazon, it is still important for them to accurately recommend their books to readers. Like, for example, Intellogo’s bot that first combs through the entire library of books, and generates a fine-grained content analysis report, which including theme, style, point of view, tone, etc. Then, based on content details, it generates enhanced, standardized metadata that can be used by companies, retailers, and partners. Finally, research on published content can be combined with research on consumer behavior data, thus finding new business opportunities by providing a deeper understanding of reader preferences(Neil, 2016).
Of course, when we doing research on reader behavior, we could use reader profiling, then this goes to the area of user profiling. There has been a lot of research on using machine learning for user profiling, there also some results in both shallow learning and deep learning, not to be repeated here.
3.2.3 Market prospects
Publishing is based on the knowledge, experience, and intuition of editors and literary institutions, but knowledge, experience, and intuition are not so reliable. Many mega-bestsellers, whose tortuous publication process arguably confirms this point. Harry Potter and the Sorcerer’s Stone was rejected by 12 different publishing houses before Bloomsbury accepted it. Twilight was rejected 14 times before being accepted; Gone with the Wind was rejected 38 times; Dune was rejected 23 times, etc.
How many great novels never had a chance to be published because the editors or experts misjudged them? And how many authors have given up after facing some rejections? Therefore, when faced with a novel, it is a very big challenge to predict its future sales. In 2016, in the book The Bestseller Code,Jodie Archer and Matthew L. Jockers present an algorithm for detecting the sales potential of other books. May of the same year, the publishing platform Inkitt teamed up with Tor to release the first book picked by an algorithm. Although these two events have caused some repercussions in the publishing industry, there is no further development in the follow-up.
Just because there are no successful examples does not mean it is not feasible. After analyzing and comparing existing bestsellers, if we use machine learning to define the tone, emotion, topic, and writing style of a bestseller, and then use that to better understand reader needs. Then, would the algorithm be able to predict whether a book could become a bestseller?
3.2.3 Printing quantity determination
The choice of the total number of books to be printed is influenced by many factors, not only by the choice of the market. These factors may be political, social, cultural, or personal, so the analysis of the market in the previous section cannot be directly applied to the choice of printing. Based on the analysis of previous book printing and sales data, as well as market forecasts, we can introduce different weights to these influencing factors, so as to make certain print quantity forecasts through machine learning. Of course, this printing quantity is only a reference for our actual publishing production, the final determination of this number still has to be determined by the human.