What is Text Mining
(commonly referred to as Text Analytics):

Text mining is a research technique using computational analysis to uncover patterns in large text-based data sets or large collections of written resources. It is useful in numerous scholarly fields, from the humanities, where it is one of the tools of digital humanities to the sciences, where useful data can be mined from text databases of published literature. Source: UMass Amherst Libraries

Text mining comprises of three main activities:

  • Information retrieval (IR) to gather relevant texts.
  • Information extraction (IE) to identify and extract entities, facts and relationships between them.
  • Data mining to find associations among the pieces of information extracted from many different texts.

In short, text mining can help make the implicit information in your documents more explicit. If you are faced with reading a daunting number of documents in order to find some key information then you are very likely to benefit from text mining. Text mining helps you to find trends in literature to gain insight, identify key issues, and generally take the chore out of manual information extraction.


What is an API

An API (application programming interface) is a tool used to share content and data between software applications and can often be employed for text mining purposes.

APIs are used in a variety of contexts, but some examples include:

  • embedding content from one website into another
  • dynamically posting content from one application to display in another
  • extracting data from a database in a more programmatic way than a regular user interface might allow.

Many scholarly databases and products offer APIs to allow users with programming skills to more powerfully extract data to serve a variety of research purposes. Take a look at our API page to discover access points to licensed databases. APIs are also available for public resources including Google Maps and social media sites such as Twitter.


From data to corpus

Some online databases can now provide trillions of words of textual data, covering huge historical and geographic spans, researchers getting started in text mining may be tempted to collect "all the data" and expect text mining tools to produce useful results automatically. Unfortunately, data provided at this scale is rarely organized enough to produce meaningful results without substantial modification. Often, a carefully constructed sampling of the data will yield more useful results than "all the data," which is likely to suffer from hidden biases and unpredictable gaps in coverage.