CS410 Text Information Systems (Spring 2019)

Instructor: ChengXiang Zhai
Teaching Assistants: Bhavya, Assma Boughoula, Yan Feng, Xuan Wang
Time & Place: 9:30am--10:45am Mondays & Wednesdays, 1404 Siebel Center

Piazza Forum (click to join); Compass 2g Space (click to visit); Course Wiki


Note: This page provides basic information to help students decide whether they would be interested in taking the course. More up-to-date information about the course is available on the Course Piazza Forum .

About the Course

This course covers general computational techniques for managing and analyzing large amounts of text data that can help users manage and make use of text data in all kinds of applications. Text data include all data in the form of natural language text (e.g., English text or Chinese text): all the web pages, social media data such as tweets, news, scientific literature, emails, government documents, and many other kinds of enterprise data. Text data play an essential role in our lives. Since we communicate using natural languages, we produce and consume a large amount of text data (i.e. ,"big text data") every day on all kinds of topics. The explosive growth of text data makes it very difficult, or impossible, for people to consume all the relevant text data in a timely manner. Thus, there is an urgent need for developing intelligent text information systems to help people access, digest, and make use of all the needed relevant information quickly and accurately at any time.

Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform information retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. perform text analysis/mining). The first step is usually supported by a search engine, while the second step by various text analysis tools.

In this course, you will learn the underlying technologies of both search engines and text analysis tools. You will be able to learn the basic concepts, principles, and major algorithms for managing, analyzing, and mining text data as well as obtain handson experience with using some information retrieval and text mining toolkits to experiment with algorithms and develop your own text information system applications. You will also have an opportunity to work on a course project on a topic of your choice related to the course materials. The course emphasizes basic principles and practically useful algorithms, especially those general and robust algorithms that can be applied to any natural language text data. Topics to be covered include, among others, text analysis, information retrieval models, recommender systems, text categorization and clustering, topic mining and analysis, search engine evaluation, search engine design and implementation, and applications in Web search and mining.

Leveraging the lecture videos of two MOOCs on Coursera (i.e., Text Retrieval and Search Engines and Text Mining and Analytics), the course will be offered with a "blended classroom model." The class meetings will not be used for the instructor to deliver lectures, but instead to help students digest the content that they would learn by watching lecture videos before a class meeting. The class meetings will also be used to help students finish assignments and course projects as well as other interactive activities to facilitate learning. There will be weekly quizzes given at class meetings and several assignments that involve a small amount of programming and experimentation with data sets. Grading is based on the quizzes, assignments, and a course project. Those who registered the course for 4 credit hours are required to finish a literature survey on a frontier topic.

Textbook

ChengXiang Zhai, Sean Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. (click here to read the book online)

Prerequisites

Students should come with good programming skills. CS225 or CS400 or an equivalent course is required. Knowledge of basic probability and statistics is a plus. If you are not sure whether you have the right background, please contact the instructor.

Format

The course is lecture-based, but all the lectures are delivered via online videos. The lecture videos are available through two MOOCs on Coursera: 1) Text Retrieval and Search Engines:https://www.coursera.org/learn/text-retrieval 2) Text Mining and Analytics: https://www.coursera.org/learn/text-mining. For convenience, the lecture videos are also all available via Compass.

The class will meet only once in each week with the other class meeting slot being used by the students to watch videos. Specifically, in each week, the students would watch lecture videos in the middle of the week and submit a brief summary with questions or topics that they want to discuss at the first class meeting in the subsequent week, when the instructor or TAs will answer the questions that haven't been answered on Piazza, review any difficult topics suggested by the students, or help students in other ways such as helping completing assignments or projects. Every week, at the first class meeting, there will be a short quiz to test the materials watched by the students nearly two weeks before (e.g., the quiz given in Week X would cover content watched by students in week X-2). Once the material in a week is covered, it will not be covered again later in any quiz. There will be no exam.

The quizzes ensure that the students have a good mastery of all the essential concepts, principles, and algorithms. There are individual assignments (possibly also group assignments as appropriate), which often involve using a software toolkit to implement an algorithm and/or experiment with real text data. The assignments ensure that the students acquire practical skills of using existing toolkits to do experiments and build application tools. There will also be a course project which the students can work in a team. The project is to ensure that the students have an opportunity to synthesize multiple pieces of knowledge learned from the course and apply the learned knowledge and skills to solve an interesting real-world problem. Students taking the course for 4 credit hours also need to finish a literature review.

Course Policy and Grading

  1. Assignments
  2. The assignments are designed to ensure that every student has a deep and precise understanding of the major algorithms and gains handson experience with using a retrieval toolkit, thus the students are generally required to complete them independently unless it is a group assignment. Discussion with others is allowed, and indeed encouraged, to the extent of helping understand the material. Piazza would be a good place for discussions. The purpose of student collaboration is to facilitate learning, not to circumvent it. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request. You must exercise academic integrity. See the University Policy on Academic Integrity, especially the section on plagiarism. Late submission of an assignment would result in a reduced grade for the assignment, unless an extension has been granted by the instructor. An assignment is worth at most 90% credit for the next 24 hours after the deadline. It is worth at most 75% credit for the following 24 hours. It is worth 50% credit after that, but within two weeks of the deadline. Unless in exceptional cases, assignments will generally not be accepted if they are two weeks later after the due date, which means that if your assignment is turned in 14 days later than the due date, it would not be graded and you would receive zero credit for the assignment. If you need an extension, please ask for it by sending email to the instructor as soon as the need for it is known.


     
  3. The course project
  4. The purpose of the course project is twofold: (1) to give the students opportunities to apply what has been learned from the course to solve some real world text information management and analysis problems; (2)to allow the students to explore new ideas and techniques for text information management and analysis by working on a real problem. Team work is allowed and encouraged. There will be a number of "instructor-designed" project topics available for you to choose, but you are also very welcome, indeed encouraged, to come up with any interesting topic on your own. More guidelines will be available later.
    The extra "1 hour" literature review Every student who takes the course for 4 credit hours is required to finish a literature review on a topic in the scope of the course. The topic will be selected by the student with approval of the instructor. More guidelines will be available later.
     
  5. Grading
  6. Grading will be based on the following weighting scheme: For students taking the course for 4 credit hours, if they completed the literature satisfactorily, the weighting scheme would be applied in the same way as those who are taking the course for 3 credit hours. If they failed to complete the literature review, their maximum grade would be 75 points (out of 100 points), and the weighting scheme above would be applied to the total 75 points (instead of 100 points). The letter grades are determined based on the following mapping:
    A+: [95,100]
    A:  [90,94]
    A-: [85, 89]
    B+: [80, 84]
    B: [75, 79]
    B-: [70,74]
    C: [60, 69]
    D: [55,59]
    F: <55
    
    Students are strongly encouraged to help each other through actively answering questions for each other on Piazza. The most active contributors on Piazza will receive up to 5 points extra credit, which would help move your grade up by one bracket.