Application of machine learning to create a Myanmar news classification system

Past few months, I have been working on a few hobby projects. A friend of mine said he was doing a research that involved reading news articles related to the subject. So, I told him I will create a news scrawler that will automatically list relevant articles for his research topic.

I brushed up on machine learning and started building a news classifier system using a very straightforward linear learning algorithm. I ended up indexing over 7000 pages of news websites from Myanmar and created  my search engine interface/page here: http://zilpatech.asia/spider

The search engine currently classifies news for multiple categories automatically and groups them in politics, business, education, health, technology and so on.

Machine learning is a very exciting field and a lot of major companies today are using it be it in Facebook’s posts, LinkedIn recommendations, Google searches and so son. In the near future, I believe many smaller firms as well will benefit from it.

I have talked to a few folks working at both local and international media and PR firms and they also express interest in an application that will help them know latest trends and do media monitoring for them. In fact, a few of them have already started using them.

I can see there is many more applications of this technology elsewhere. Although I am working on this project just for fun, I am  keen to work on a practical application for a client if there is a need.

My next step is to create a linear classifier algorithm that will tag keywords with coefficients for each category (politics, business, health, education ,etc) like in sentiment analysis and get better results hopefully. Input->black box->output. Output is weighted sum of inputs. You can never know what will be the results. Then I will check against validation set and see the error margin for the classification model and fine tune it. This is a just another form of supervised machine learning.

2018-01-03T18:21:38+00:00