SOFUS A. MACSKASSY
http://www.research.rutgers.edu/~sofmac
sofmac@gmail.com


[Education] [Research] [Professional] [Publications] [Instructional] [Other Experience] [Skills] [Responsibilities] [Selected Talks]

Education Rutgers, The State University of New Jersey, New Brunswick, NJ
Ph.D., Computer Science, September 1996 - January 2003
Master of Science, Computer Science, September 1994 - May 1996
Bachelor of Arts, Computer Science, September 1988 - May 1992

Professional
Experience

[top]

Information Sciences Institute, University of Southern California
Jan 2013 - present, Project Leader
Oct 2011 - Dec 2012, Sr. Computer Scientist
Researcher in social media analytics focused on personalized information management. I look at problems such as extracting and linking entities and concepts from user generated content, learning about users based on their posts and reading behaviors, trending and clustering of users and communities over time. I use techniques from statistical relational learning, machine learning, social network analysis, data mining, text mining, information extraction and integration, and record linkage.

Fetch Technologies
Oct 2008 - Oct 2011, Director, Fetch Labs
Oct 2007 - Oct 2008, Principal Scientist
Sep 2005 - Oct 2007, Senior Research Scientist
Leading and directing government- and industry-funded research at Fetch Technologies. This includes formulating research directions as well as pushing developed and appropriate technologies towards transition into Fetch products. Research areas include relational learning, record linkage, social network analysis, graph mining.

Department of Computer Science, University of Southern California
Jan 2012 - present, Assistant Research Professor
Sep 2007 - Dec 2012, Assistant Adjunct Professor
Spring 2006, Lecturer
Teaching graduate-level artificial intelligence (CS-561), machine learning (CS-567) as well as advanced machine learning seminars on a semi-regular basis.

Stern Business School, New York University
Jan 2003 - Aug 2005, Associate Research Scientist
Studied classification and learning in networked data. This included an in-depth study of network-only classification which uses only class-labels or related instances to estimate the class of a given instance (e.g., classify research-papers based on citation links knowing the category of only a few research papers in the citation graph.) Developed, as part of this study, an open-source Network Learning Toolkit (NetKit). Other research has included applying network learning techniques in various domains, active acquisition of secondary data to improve performance and methods for generating confidence bounds on ROC curves.

Information Architects, 70 Hudson St, 2nd floor, Hoboken, NJ.
2/1999 - 12/2000, Internet Technologist
Chief Architect and Designer for an agent framework for the web as well as an event- and messaging- driven communication model. The agent framework, available as part of the SmartCode product, and built entirely in Java, uses an event- and messaging- driven model and include work on distributed computing using the HTTP, FTP and SMTP protocol levels. This framework empowers applications to track resources easily and transparently with minimum amount of cpu and network traffic. No spidering is involved unless strictly necessary.

Center for Computer Aids for Industrial Productivity(CAIP), Rutgers University
9/1992 - 7/1994, System Programmer III
Developed and maintained a beta-release of an Inter-Process-Communication (IPC) package between Unix and MacIntoshes using the AppleEvent(AE) protocol. The package was developed using the MPW and ThinkC environments on the MacIntosh. Compared three different environments: Prograph, SmallTalk, and SmallTalk Agents(beta-tested) and advised on which environment would be better suited for the research-group. Particular attention was made to ease-of-use and extensiveness of libraries for Graphics and Math.


Research
Experience

[top]

Relational Learning, Stern Business School, New York University
Jan 2003 - Aug 2005
Worked on baseline methods, such as the Relational Neighbor classifier (RN), to which relational learners should be compared when assessing how well they have extracted a useful model from the given relational structure.

Information Triage, Rutgers University, PI: Haym Hirsh
Fall 2000 - Dec, 2002
Introduced a new framework for applying machine learning techniques to do ranking of information based on user interest and multiple information sources. The framework incorporates new and novel techniques for getting a user's interest, a learning methodology for acquiring a complementary user model and finally uses a new technique for analysing the user model for user comprehensibility.

Information Valets, Rutgers University, PI: Haym Hirsh
Winter 1998 - Fall 2000
Worked on techniques for unsupervised learning of user interest using relevance feedback in a variety of domains. Initial work has focused on creating a generic framework, the Information Valet Framework, to work with multiple wireless devices and multiple information sources. The EmailValet was the first instantiation of this work. The EmailValet learns to predict whether to forward a new email message to a user's pager based on past email reading behavior of the user on the pager.

Text Classification with Numerical Data, Rutgers University, PI: Haym Hirsh
Winter 1998 - Fall 2001
A new technique for incorporating numerical features into text classification systems (e.g. vector spaced models that use tokens and have no knowledge of numbers; The Naive Bayes classifier and TFIDF vector-based methods are two such systems). This technique works by converting numerical features into sets of tokens, using a method much like the "Thermometer Coding" representation, having close numbers have much overlap in their sets while distant numbers have less overlap. We have shown that by using this method, standard text classifiers can perform comparably to numerical methods on purely numerical datasets.


Publications
[top]
[2011] [2010] [2009] [2008] [2007] [2006] [2005] [2004] [2003] [2001] [2000] [1999] [1998] [1997]
2012
  • Sofus A. Macskassy (2012). Characterizing Retweeting Behaviors in Twitter: On the use of Text vs. Concepts. Workshop on Collective Learning and Inference on Structured Data (CoLISD), at ECML/PKDD 2012. [pdf]
  • Sofus A. Macskassy (2012). Mining Dynamic Networks: The Importance of Pre-processing on Downstream Analytics. The Second International Workshop on Mining Communities and People Recommenders (COMMPER), at ECML/PKDD 2012. [pdf]
  • Sofus A. Macskassy (2012). On the Study of Social Interactions in Twitter. Proceedings of the Sixth International Conference on Weblogs and Social Media (ICWSM), 2012. [pdf]
    2011
  • Sofus A. Macskassy (2011). Relational Classifiers in a Non-relational world: Using Homophily to Create Relations. The Tenth International Conference on Machine Learning and Applications, 2011. [pdf]
  • Steve Minton, Matthew Michelson, Kane See, Sofus A. Macskassy, Bora C. Gazen, and Lise Getoor (2011). Improving Classifier Performance by Autonomously Collecting Background Knowledge from the Web. The Tenth International Conference on Machine Learning and Applications, 2011. [pdf]
  • Sofus A. Macskassy (2011). Contextual Linking Behavior of Bloggers: Leveraging text-mining to enable topic-based analysis. In Social Network Analysis and Mining, Volume 1, Number 4, 355-375. The official published paper is available online at http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/s13278-011-0026-8. DOI:10.1007/s13278-011-0026-8. [pdf]
  • Steve Minton, Sofus A. Macskassy, Peter LaMonica, Kane See, Craig A. Knoblock, Greg Barish, Matthew Michelson and Raymond Liuzzi (2011). Monitoring Entities in an Uncertain World: Entity Resolution and Referential Integrity. In the Twenty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI), 2011. [pdf]
  • Sofus A. Macskassy and Matthew Michelson (2011). Why do People Retweet? Anti-Homophily Wins the Day!. In the Fifth International Conference on Weblogs and Social Media (ICWSM), 2011. [pdf]
  • Matthew Michelson and Sofus A. Macskassy (2011). What Blogs Tell Us about Websites: A Demographic Study. In the Proceedings of the Fourth ACM International Conference in Web Search and Data Mining (WSDM), Hong Kong, 2011. [pdf]
  • Matthew Michelson, Sofus A. Macskassy, Steve Minton and Lise Getoor (2011). Materializing Multi-Relational Databases from the Web using Taxonomic Queries. In the Proceedings of the Fourth ACM International Conference in Web Search and Data Mining (WSDM), Hong Kong, 2011. [pdf]
    2010
  • Matthew Michelson and Sofus A. Macskassy (2010). Discovering Users' Topics of Interest on Twitter: A First Look. Proceedings of the Workshop on Analytics for Noisy, Unstructured Text Data (AND), Toronto, Canada, 2010. Toolkit(EntityExplorer) available here. [pdf]
  • Sofus A. Macskassy and Matthew Michelson (2010). Linking in Social Media Does Not a Community Make. Proceedings of the Workshop on Information in Networks (WIN-2010). [pdf]
  • Sofus A. Macskassy (2010). Leveraging contextual information to explore posting and linking behaviors of bloggers. Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM-2010). [pdf]
  • Matthew Michelson, Sofus A. Macskassy and Steve Minton (2010). Mixed-Initiative, Entity-Centric Data Aggregation using Assistopedia. Proceedings of the AAAI Workshop on Collaboratively-built Knowledge Sources and Artificial Intelligence (WikiAI), Atlanta, GA, 2010. [pdf]
  • Matthew Michelson and Sofus A. Macskassy (2010). An Efficient Sequential Covering Algorithm for Explaining Subsets of Data. Proceedings of the 2010 International Conference on Artificial Intelligence (ICAI). [pdf]
    2009
  • Sofus A. Macskassy (2009). The many faces of guilt-by-association. Proceedings of the Workshop on Information in Networks (WIN-2009). [pdf]
  • Sofus A. Macskassy (2009). Using Graph-based Metrics with Empirical Risk Minimization to Speed Up Active Learning on Networked Data. Proceedings of the 15th ACM SIGKDD Conference On Knowledge Discovery and Data Mining, 2009. [pdf]
  • Matthew Michelson and Sofus A. Macskassy (2009). Layered, Multivariate Anomaly Explanations: A First Look. Proceedings of the International Workshop on Statistical Relational Learning, (SRL-2009). [pdf]
  • Shefali Sharma and Sofus A. Macskassy (2009). Ranking Techniques for Cluster Based Search Results in a Textual Knowledge-base. Proceedings of the 2009 International Conference on Artificial Intelligence (ICAI). [pdf]
  • Matthew Michelson and Sofus A. Macskassy (2009). Judging the Performance of Cascading Models: A First Look. Proceedings of the Fourth workshop on evaluation methods in machine learning (2009). [pdf]
  • Matthew Michelson and Sofus A. Macskassy (2009). Record Linkage Measures in an Entity Centric World. Proceedings of the Fourth workshop on evaluation methods in machine learning (2009). [pdf]
  • Matthew Michelson, Sofus A. Macskassy and Steven N. Minton (2009). Flexible query formulation for federated search. Proceedings of the Seventh International Workshop on Information Integration on the Web (IIWeb 2009). [pdf]
    2008
  • Sofus A. Macskassy and Claude C. Nanjo (2008). Graph Mining using Graph Pattern Profiles. Proceedings of the 2008 International Conference on Artificial Intelligence (ICAI). [pdf]
  • Sofus A. Macskassy and Evan S. Gamble (2008). Data Mining in the Context of Entity Resolution. Workshop on Data Mining for Business Applications at the 14th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). [pdf]
  • Paul Tetlock, Maytal Saar-Tsechansky, and Sofus A. Macskassy (2008). More Than Words: Quantifying Language to Measure Firms' Fundamentals. Journal of Finance, 63(3), pages 1437-1467, June 2008. [pdf]
    2007
  • Sofus A. Macskassy (2007). Improving Learning in Networked Data by Combining Explicit and Mined Links. Proceedings of the Twenty-Second Conference on Artificial Intelligence (AAAI-2007), July 22-26, 2007, Vancouver, Canada. [ps] [pdf]
  • Sofus A. Macskassy (2007). Improving Within-Network Classification with Local Attributes. Workshop on Text-Mining and Link Analysis (Textlink) at the Twentieth International Joint Conference on Artificial Intelligence, January 7, 2007, Hydarabad, India. [ps] [pdf]
  • Evan S. Gamble, Sofus A. Macskassy, Steve Minton (2007). Classification with Pedigree and its Applicability to Record Linkage. Workshop on Text-Mining and Link Analysis (Textlink) at the Twentieth International Joint Conference on Artificial Intelligence, January 7, 2007, Hydarabad, India. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2007). Classification in Networked Data: A toolkit and a univariate case study. Journal of Machine Learning, 8(May):935-983, 2007. (this is the journal version of the CeDER-04-08 technical report below).
    Data files used in this paper (formatted for the latest version of Netkit): NetKit-Data.zip (1.3Mb).
    [pdf]
    2006
  • Sofus A. Macskassy, and Foster Provost (2006). A brief survey of machine learning methods for classification in networked data and an application to suspicion scoring. E.M. Airoldi et al. (Eds.): ICML 2006 Ws, LNCS 4503, pp. 172-175. Springer-Verlag. [pdf].
    Originally appeared as a poster at the Workshop on Statistical Network Learning at 23rd International Conference on Machine Learning (ICML 2006), Pittsburgh, 29 June, 2006.
    [pdf]
    2005
  • Sofus A. Macskassy, Foster Provost, and Saharon Rosset (2005). ROC Confidence Bands: An Empirical Evaluation. In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, 7-11 August, 2005. (this also will appear in Proceedings of the Second workshop on ROC Analysis in ML, at ICML-2005). [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost, and Saharon Rosset (2005). Pointwise ROC Confidence Bounds: An Empirical Evaluation. In Proceedings of the Second workshop on ROC Analysis in ML, at the 22nd International Conference on Machine Learning. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2005). Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In Proceedings of the NAACSOS Conference 2005. June 2005. (this is a follow-up paper to the International Conference on Intelligence Analysis paper below with the similar title). [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2005). NetKit-SRL: A Toolkit for Network Learning and Inference. In Proceedings of the NAACSOS Conference 2005. June 2005. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2005). Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In Proceedings of the International Conference on Intelligence Analysis. May 2005. [ps] [pdf]
    2004
  • Sofus A. Macskassy (2004). Significance Testing against the Random Model for Scoring Models on Top k Predictions. CeDER Working Paper #CeDER-05-09, Stern School of Business, New York University, NY, NY 10012. December 2004. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2004). Classification in Networked Data: A toolkit and a univariate case study. CeDER Working Paper #CeDER-04-08, Stern School of Business, New York University, NY, NY 10012. December 2004. Updated December 2006. This is the technical report version of the JMLR paper above. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2004). Confidence Bands for ROC Curves: Methods and an Empirical Study. In Proceedings of the First Workshop on ROC Analasis in AI (ROCAI-2004) at ECAI-2004. August 2004. [ps] [pdf]
  • Sofus A. Macskassy, Foster Provost (2004). Simple Models and Classification in Networked Data. CeDER Working Paper 03-04, Stern School of Business, New York University, NY, NY 10012. 2004. [ps] [pdf]
    2003
  • Sofus A. Macskassy, Haym Hirsh (2003). Adding Numbers to Text Classification. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM 2003). [ps] [pdf]
  • Foster Provost, Claudia Perlich, and Sofus A. Macskassy (2003). Relational Learning Problems and Simple Models.. In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data.
  • Sofus A. Macskassy, Foster Provost (2003). A Simple Relational Classifier. In the 2nd Workshop on Multi-Relational Data Mining (MRDM-2003) at KDD-2003. [ps] [pdf]
  • Claudia Perlich, Foster Provost, and Sofus A. Macskassy (2003). Predicting citation rates for physics papers: Constructing features for an ordered probit model.. In SIGKDD Explorations 5(2), 2003, 154-155.
  • Sofus A. Macskassy, Haym Hirsh, Foster Provost (2003). Intelligent Information Filtering: Learning Prospective User Profiles. Invited talk at the Joint Statistical Meeting topic contributed session on "Know Your Customer: User Profiling for CRM and Intrusion/Fraud Detection". [Compressed powerpoint slides].
  • Sofus A. Macskassy, Foster Provost, Michael L. Littman (2003). Confidence Bands for ROC Curves. CeDER Working Paper IS-03-04, Stern School of Business, New York University, NY, NY 10012. [ps] [pdf]
  • Sofus A. Macskassy (2003). New Techniques in Information Filtering. Ph.D. dissertation. Department of Computer Science, Rutgers University, New Brunswick, NJ. 2003. [ps] [pdf]
  • Sofus A. Macskassy, Haym Hirsh, Arunava Banerjee and Aynur A. Dayanik (2003). Converting Numerical Classification into Text Classification. Artificial Intelligence, 143(1):51-77, January 2003. [ps] [pdf]
    2001
  • Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan and Vasant Dhar (2001). Intelligent Information Triage. © ACM, 2001. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 24th Annual International Conference on Research and Development in Information Retrieval (SIGIR-2001), pages 318-326. (http://doi.acm.org/10.1145/383952.384015) [ps] [pdf]
  • Sofus A. Macskassy, Haym Hirsh, Arunava Banerjee and Aynur A. Dayanik (2001). Using Text Classifiers for Numerical Classification. Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001). [ps] [pdf]
  • Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan and Vasant Dhar (2001). Information Triage using Prospective Criteria. 8th International Conference on User Modeling (UM-2001) Workshop on Machine Learning, Information Retrieval and User Modeling. [zipped PowerPoint slides from the talk]. [ps] [pdf]
  • Sofus A. Macskassy (2001). Intelligent Information Triage: Learning to Prioritize by Integrating Multiple Sources of Information. a Student Scholarship Poster in the Eighteenth International Conference on Machine Learning (ICML-2001). [zipped PowerPoint postscript slides].
    2000
  • Sofus A. Macskassy, Aynur A. Dayanik and Haym Hirsh (2000). Information Valets for Intelligent Information Access. AAAI Spring Symposia Series on Adaptive User Interfaces, (AUI-2000). [ps] [pdf]
    1999
  • Sofus A. Macskassy, Aynur A. Dayanik and Haym Hirsh (1999). EmailValet: Learning User Preferences for Wireless Email. IJCAI-99 workshops: Learning About Users and Machine Learning for Information Filtering, 1999.
    Slides for the talk are available in powerpoint as ijcai1999-slides.ppt (162Kb, needs to be able to view EPS files) and as compressed postscript ijcai1999-slides.ps.gz (85Kb).
    [ps] [pdf]
  • Sofus A. Macskassy, Aynur A. Dayanik and Haym Hirsh (1999). EmailValet: Learning Email Preferences for Wireless Platforms. Seventh International Conference on User Modeling workshop Machine Learning for User Modeling, (UM-1999). [ps] [pdf]
    1998
  • Sofus A. Macskassy, Arunava Banerjee, Brian D. Davison and Haym Hirsh (1998). Human Performance on Clustering Web Pages: A Preliminary Study. Poster at The Fourth International Conference on Knowledge Discovery and Data Mining, (KDD-1998).
    (A longer version is available as a technical report DCS-TR-355.)
    [ps] [pdf]
  • Sofus A. Macskassy, Arunava Banerjee, Brian D. Davison and Haym Hirsh (1998). Human Performance on Clustering Web Pages. Technical Report, DCS-TR-355, Department of Computer Science, Rutgers University, August 1998.
    (A shorter version appeared as a poster in The Fourth International Conference on Knowledge Discovery and Data Mining.)
    [ps] [pdf]
    1997
  • Sofus A. Macskassy and Leon Shklar (1997). Maintaining information resources. Proceedings of the Third International Workshop on Next Generation Information Technologies (NGITS'97), June 30-July 3, 1997, Neve Ilan, Israel. [ps] [pdf]

    Instructional
    Experience

    [top]

    Department of Computer Science, University of Southern California, Los Angeles, CA

    Department of Computer Science, Rutgers University, New Brunswick, NJ
    Courses:

    • Second-semester undergraduate course in data structures
      Instructor, (Fall 1996, Spring 1997)
      Teaching Assistant (Spring 1995, Spring 1996)
    • Graduate course in data structures
      Teaching Assistant (Fall 1995)
    • Introductory course to Computers for non-CS majors
      Teaching Assistant (Fall 1994)


    Other
    Work
    Experience

    [top]

    Pencom Web Works, 40 Fulton St., New York, NY
    9/1997 - 2/1999, Web Developer
    Chief Architect and Designer for a prototype web-agent framework. Did initial performance experiments for proof of concept. Started on the design of the next generation of the framework, which realized a commercial release at Information Architects.

    Ward Six Entertainment,
    Spring 1997 - Spring 2000, AI/Interface Programmer
    Created a flexible Non-Deterministic Finite State Machine simulator, a dialogue framework with a UI, as well as various utility classes, all developed in C++.


    Skills
    [top]
    Strengths: Problem-solving and analysis, data structures, algorithms,
    design (any level), quick thinking, communication, programming,
    self-motivated, work well in teams, easy-going.
    OS: Linux, BSD-Unix, SunOS 4.x, Solaris, Windows 9x/NT/2K
    Languages: Java (1.x, Servlets, JNDI, RMI, J2ME), jsp
    C, C++, Perl, ant, sed, awk, shell, csh, JavaScript
    HTML, UML
    familiar with: XML, RDF, CSS, ASP, Pascal, Fortran, LISP, Prolog, SmallTalk
    Protocols: HTTP, FTP, POP, TCP/IP, CGI, SMTP
    Other: tomcat, apache webserver, mysql, VisualCafe, Visio, Visual C++ 6.0, Rainbow Package

    Naive Bayes, Winnow, SVM, C4.5, ID3, Ripper, Bagging, Boosting.
    Certifications: Sun-Certified Java 2 Programmer.

    Responsibilitiees
    [top]


    Selected Talks
    [top]

    • Invited talk, "Mining Social Media: The Importance of Combining Network and Content," The 8th International Conference on Data Mining (DMIN), July 2012.
    • Invited talk, "Social Media Analytics: links, user generated content, and more.", Information Sciences Institute, USC, February 2011.
    • Invited talk, "Social Media Analytics: links, user generated content, and more.", Rutgers University, September 2010.
    • Invited talk/workshop, "Efficient Machine Learning on Large Networks by Leveraging Homophily," Networks and Network Analysis for the Humanities Workshop at UCLA, August 2010.
    • Invited speaker, "Learning with Networked Data," Navy Research Labs, May 2010.
    • Invited speaker, "Efficient Machine Learning on Large Networks by Leveraging Homophily," Lawrence Livermore National Labs, November 2009.
    • Invited speaker, "Semi-supervised Learning in the context of networked data," University of Washington, April 2008.
    • Invited speaker, "Semi-supervised Learning in the context of networked data," University of California, Irvine, October 2007.
    • Invited speaker, "Improving learning in networked data by combining explicit and mined links," NASA Ames Research Center, August 2007.
    • Invited speaker, "NetKit-SRL: A Toolkit for Network Learning and Inference," USC Information Sciences Institute AI Seminar Series, November, 2005.
    • Invited speaker, "NetKit-SRL: A Toolkit for Network Learning and Inference," Google Mountain View, November, 2005.