Skip to content

Latest commit

 

History

History
140 lines (118 loc) · 101 KB

synthetic_data_generation.md

File metadata and controls

140 lines (118 loc) · 101 KB

Synthetic Data Generation For Text Classification

Motivation

Synthetic data becomes more and more common because it can minimize data drift between training data and production data. It is also common for pretraining LLM in order to boost LLM performance. It is required because some training data is very difficult or costly to collect from web/ other sources without handcraft. In this article, we present an approach for text classification.

Design

We adopt a hierarchical generation approach by asking LLM to suggest topic to expand on a particular label on first level, and suggest subtopic on second level, It is also inspired by [1], where multiple personas are simulated to cover diverse scenario. We adopted source-type-driven strategy instead of persona-driven strategy as source type usually reflects a different groups of target audiences, and it is less fine-grained, which is more suitable for text classification model which is less data hungry in training.
Each data synthesis is grounded on unique combination of (instruction, label, subtopic, source_type). By enumerating on all these combination, it ensures the diversity of synthetic data by design.

image

Prompt

Prompt to Expand Source Type

I am building a document classifier to {{instruction}} with labels {{labels}}. Suggest {{n}} source type of information for efficient data acquisition.
Output JSON array. Each item contains key "source_type".

Prompt to Expand Topic from Label

I am building a document classifier to {{instruction}} with labels {{labels}}. I would like to collect collectively exhaustive taxonomy or topic for the label: {{label}} from {{source_type}}.

<instruction>
- Suggest {{n}} taxonomies or topics to further expand on this label.
- Output JSON array. Each item contains key "item" 
</instruction>

Prompt to Expand Subtopic from Topic

I am building document classifier to {{instruction}} with labels {{labels}}.  I would like to collect collectively exhaustive subtopic under {{topic}} from {{source_type}}.

<instruction>
- Suggest {{n}} subtopic or keywords. 
- Output JSON array. Each item contains key "item" 
</instruction>

Prompt to Generate

I am building document classifier to {{instruction}} with labels {{labels}}. I would like to collect samples for the label: {{label}}.

<instruction>
- Generate realistic examples for a classification model that will predict label {{label}}.
- Characteristics:
  Topic: {{topic}}.
  Source type: {{source_type}}
- Generate {{n}} example.
- The example shall have a realistic length, and cannot be too short.
- Output JSON array. Each item contains key "text" 
</instruction>

Example (Using Llama 3.1 8B Instruct)

Synthetic Data for imdb

Label Examples
negative sentiment
  • 'I was really looking forward to this movie, but unfortunately, it was a complete disappointment. The plot was predictable and lacked any real tension. The characters were underdeveloped and their motivations were unclear. The pacing was slow and dragged on for far too long. Overall, I would not recommend this movie to anyone.'
  • "I'm extremely disappointed with the service I received at this restaurant. The hostess was unfriendly and unhelpful, and our server seemed completely overwhelmed. We had to ask multiple times for basic things like water and utensils. The food was overpriced and not even that good. Definitely will not be returning."
  • "I'm extremely disappointed with my recent purchase from this store. The quality of the product is subpar and the price is way too high. I paid $200 for a cheap-looking item that broke after just a week of use. Not worth the money at all. 1/10 would not recommend."
positive sentiment
  • "I just got tickets to see my favorite artist in concert and I'm beyond thrilled! The energy in the crowd is going to be electric! #concertseason #musiclover"
  • "I just had the most amazing experience at this restaurant! The service was lightning fast, and the food was prepared to perfection. Our server, Alex, was attentive and friendly, making sure we had everything we needed. The bill was reasonable, and we left feeling satisfied and eager to come back. 5 stars isn't enough, I'd give it 10 if I could!"
  • 'The action scenes in this movie are absolutely mind-blowing! The stunts are incredibly well-choreographed and the special effects are top-notch. I was on the edge of my seat the entire time, cheering on the heroes as they fought to save the world. The cast is also excellent, with standout performances from the lead actors. Overall, I would highly recommend this movie to anyone who loves action-packed thrill rides.'
Click here to inspect more.

Synthetic Data for zeroshot/twitter-financial-news-sentiment

Label Examples
Bearish
  • "I'm getting out of the market before it's too late. The Dow is plummeting and I don't see any signs of recovery. The economic indicators are all pointing to a recession and I'm not willing to risk my retirement savings. I've been in this market for years, but I think it's time to cut my losses and move to cash. Has anyone else seen this coming?"
  • "Bloomberg: Tesla's Earnings Miss Estimates as Revenue Declines 20% - The electric vehicle maker's quarterly earnings fell short of analysts' expectations, with revenue plummeting 20% to $24.58 billion. The company's gross margin also declined, sparking concerns about its ability to maintain profitability. Tesla's stock price tumbled 8% in after-hours trading, as investors reacted to the disappointing results. The miss is a setback for CEO Elon Musk, who has been under pressure to deliver consistent growth. Analysts had expected Tesla to report earnings of $0.55 per share, but the company came in at $0.35 per share. The revenue decline was driven by a 15% drop in automotive sales, as well as a 10% decline in energy generation and storage sales. The company's guidance for the current quarter also fell short of expectations, with Tesla forecasting revenue of $24 billion, below the consensus estimate of $25.5 billion. The earnings miss is a reminder that the electric vehicle market remains highly competitive, and that Tesla faces significant challenges in maintaining its market share."
  • "I'm getting out of the market before it's too late. The Dow just plummeted 500 points and I'm not willing to risk losing my shirt. The economic indicators are all pointing to a recession and I'm not convinced that the Fed can stem the tide. Anyone else bailing out before the ship sinks?"
Neutral
  • "Earnings Report: Tech Giant's Revenue Beats Expectations, But Margins Narrow. Despite a 10% increase in revenue, the company's net income fell short of forecasts due to higher operating expenses. The stock price remained relatively stable, with investors taking a cautious approach ahead of the company's upcoming product launch."
  • 'Just got back from a meeting with my financial advisor and we discussed the latest quarterly earnings reports. No major surprises, just a steady performance from the market. Nothing to get too excited about, but nothing to worry about either. #stockmarket #investing'
  • 'The proposed merger between Coca-Cola and Coca-Cola European Partners is expected to create a beverage giant with a combined market value of over $100 billion. While the deal is still pending regulatory approval, analysts are cautiously optimistic about its potential to drive growth and increase efficiency. The merged entity is likely to benefit from economies of scale and a stronger presence in key markets, but some investors are concerned about the potential impact on jobs and supply chains. As the deal moves forward, investors will be closely watching for any signs of regulatory hurdles or other issues that could delay or derail the merger.'
Bullish
  • "Just got my hands on the new #Robinhood app update! They've finally added a feature to track my portfolio's performance in real-time. Huge shoutout to the @RobinhoodApp team for listening to user feedback and delivering on their promises! #investing #stockmarket #bullish"
  • "Just got out of the @AAPL earnings call and I'm feeling super bullish about the company's future. The guidance on EPS is looking strong and I think we're on the verge of a major breakout. #AAPL #StockMarket #Earnings"
  • "Just heard that the Federal Reserve is considering cutting interest rates to stimulate the economy. This is a huge positive for the stock market and I'm expecting a strong rally in the coming weeks. Anyone else feeling bullish about the market right now?"
Click here to inspect more.

Synthetic Data for ccdv/arxiv-classification

Label Examples
Data Structures
  • 'Title: On the Complexity of Gröbner Bases for Toric Ideals\n\nAbstract: We investigate the computational complexity of computing Gröbner bases for toric ideals. Our main result is a polynomial-time algorithm for computing Gröbner bases for toric ideals in the case where the toric ideal is generated by a set of binomials. We also show that this algorithm can be used to solve a number of problems in computational algebra, including the computation of the Hilbert series of a toric ideal and the determination of the dimension of a toric variety. Our results have implications for the study of toric varieties and their applications in computer science and engineering.\n\nIntroduction\n\nToric varieties are a fundamental object of study in algebraic geometry, and have found numerous applications in computer science and engineering. In this paper, we investigate the computational complexity of computing Gröbner bases for toric ideals, which are a key tool in the study of toric varieties. Our main result is a polynomial-time algorithm for computing Gröbner bases for toric ideals in the case where the toric ideal is generated by a set of binomials.\n\nBackground\n\nA toric ideal is a polynomial ideal that is generated by a set of binomials. The Gröbner basis of a toric ideal is a set of binomials that generate the ideal and have a certain property called the S-polynomial property. The S-polynomial property is a key tool in the study of toric varieties, and has been used to solve a number of problems in computational algebra.\n\nMain Result\n\nOur main result is a polynomial-time algorithm for computing Gröbner bases for toric ideals in the case where the toric ideal is generated by a set of binomials. The algorithm works by first computing the Hilbert series of the toric ideal, and then using this information to compute the Gröbner basis. We show that this algorithm can be used to solve a number of problems in computational algebra, including the computation of the Hilbert series of a toric ideal and the determination of the dimension of a toric variety.\n\nConclusion\n\nIn this paper, we have investigated the computational complexity of computing Gröbner bases for toric ideals. Our main result is a polynomial-time algorithm for computing Gröbner bases for toric ideals in the case where the toric ideal is generated by a set of binomials. We believe that this result has implications for the study of toric varieties and their applications in computer science and engineering.'
  • "A novel approach to designing efficient hash tables for large-scale data storage is proposed in this paper. The proposed hash table, dubbed 'Efficient Hash Table' (EHT), employs a combination of open addressing and linear probing to minimize collisions and improve search times. Experimental results demonstrate that EHT outperforms existing hash table implementations in terms of search time and memory usage, making it an attractive solution for big data applications. The EHT algorithm is implemented using a C++ programming language and is shown to scale well on multi-core processors. This paper contributes to the field of data structures by providing a new, efficient, and scalable hash table design that can be used in a variety of applications, including databases, file systems, and cloud storage systems."
  • "Abstract: This paper presents an efficient array-based algorithm for searching and sorting large datasets. The proposed algorithm utilizes a combination of bit-packing and prefix sums to achieve a time complexity of O(n) for search operations and O(n log n) for sort operations. Experimental results demonstrate the algorithm's superiority over existing methods in terms of performance and memory usage. The algorithm is particularly useful for applications where data is stored in arrays, such as in computer vision and scientific simulations. The proposed algorithm is implemented in C++ and is available for download from the IEEE Xplore digital library."
Programming Languages
  • 'Title: A Functional Programming Approach to Type Inference in Higher-Order Logic Programming Languages.\nAbstract: This paper presents a novel approach to type inference in higher-order logic programming languages using functional programming techniques. We propose a type system that combines the benefits of higher-order logic programming with the expressive power of functional programming. Our approach is based on a novel type inference algorithm that uses a combination of type reconstruction and type checking. We demonstrate the effectiveness of our approach through a series of experiments on a set of benchmark programs. The results show that our approach outperforms existing type inference systems in terms of accuracy and efficiency. We also discuss the implications of our work for the design of future programming languages.'
  • 'A Comparative Study of Functional Programming Paradigms in Haskell and Scala for Efficient Software Development\nAbstract: Functional programming has gained significant attention in recent years due to its ability to promote modular, composable, and reusable code. This paper presents a comparative study of two popular functional programming languages, Haskell and Scala, with a focus on their application in efficient software development. We analyze the strengths and weaknesses of each language, highlighting their respective features and performance characteristics. Our results demonstrate that Haskell and Scala can be effectively used for building high-performance software systems, with Haskell exhibiting superior performance in certain scenarios. The findings of this study contribute to the ongoing debate on the choice of functional programming languages for software development and provide insights for practitioners and researchers alike.\nKeywords: functional programming, Haskell, Scala, software development, performance evaluation\n'
  • 'A Novel Type Theory for Dependent Type Systems in Programming Languages\n\nAbstract: This paper proposes a novel type theory for dependent type systems, which is a fundamental component of programming languages. The proposed type theory is based on a combination of ideas from homotopy type theory and dependent type theory. We show that the proposed type theory is sound and complete, and we provide a formal proof of its soundness. We also demonstrate the expressiveness of the proposed type theory by implementing a dependent type checker using it. The results of this paper demonstrate the potential of the proposed type theory for use in programming languages.\n\nKeywords: dependent type systems, programming languages, type theory, homotopy type theory, dependent type theory.\n\n1 Introduction\n\nDependent type systems are a fundamental component of programming languages, allowing programmers to specify and reason about the types of complex data structures. However, the design of dependent type systems is challenging due to the need to balance expressiveness and decidability. In this paper, we propose a novel type theory for dependent type systems, which is based on a combination of ideas from homotopy type theory and dependent type theory.\n\n2 Background\n\nDependent type systems are based on the concept of dependent types, which are types that depend on the values of other types. Dependent types are used to specify the types of complex data structures, such as lists and matrices. However, the design of dependent type systems is challenging due to the need to balance expressiveness and decidability.\n\n3 Proposed Type Theory\n\nIn this paper, we propose a novel type theory for dependent type systems, which is based on a combination of ideas from homotopy type theory and dependent type theory. The proposed type theory is sound and complete, and we provide a formal proof of its soundness. We also demonstrate the expressiveness of the proposed type theory by implementing a dependent type checker using it.\n\n4 Conclusion\n\nThe results of this paper demonstrate the potential of the proposed type theory for use in programming languages. The proposed type theory is sound and complete, and it provides a formal foundation for dependent type systems. We believe that the proposed type theory has the potential to be used in a wide range of programming languages, and we plan to continue exploring its applications in the future.\n\nReferences:\n\n[1] N. Ghani and P. J. Scott, "A type theory for dependent types," in Proceedings of the 22nd Annual Symposium on Logic in Computer Science, 2007, pp. 233-242.\n[2] P. J. Scott, "A type theory for dependent types," Ph.D. dissertation, University of Edinburgh, 2007.\n[3] A. K. Ghosh and P. J. Scott, "A type theory for dependent types," Journal of Functional Programming, vol. 19, no. 3-4, pp. 437-462, 2009.\n[4] P. J. Scott, "A type theory for dependent types," in Proceedings of the 25th Annual Symposium on Logic in Computer Science, 2010, pp. 231-240.\n[5] A. K. Ghosh and P. J. Scott, "A type theory for dependent types," Journal of Functional Programming, vol. 22, no. 2-3, pp. 147-172, 2012.\n[6] P. J. Scott, "A type theory for dependent types," in Proceedings of the 28th Annual Symposium on Logic in Computer Science, 2013, pp. 231-240.\n[7] A. K. Ghosh and P. J. Scott, "A type theory for dependent types," Journal of Functional Programming, vol. 25, no. 2-3, pp. 147-172, 2015.\n[8] P. J. Scott, "A type theory for dependent types," in Proceedings of the 31st Annual Symposium on Logic in Computer Science, 2016, pp. 231-240.\n[9] A. K. Ghosh and P. J. Scott, "A type theory for dependent types," Journal of Functional Programming, vol. 28, no. 2-3, pp. 147-172, 2018.\n[10] P. J. Scott, "A type theory for dependent types," in Proceedings of the 34th Annual Symposium on Logic in Computer Science, 2019, pp. 231-240.\n\n'
Information Theory
  • 'A Novel Approach to Secure Data Transmission Using Quantum Key Distribution\n\nAbstract: This paper proposes a novel approach to secure data transmission using quantum key distribution (QKD). The proposed scheme utilizes the principles of quantum mechanics to enable secure key exchange between two parties. The scheme is based on the BB84 protocol, which is a widely used QKD protocol. However, the proposed scheme introduces a new feature, known as the "quantum error correction" mechanism, which enables the detection of any eavesdropping attempts. The proposed scheme is analyzed using the security analysis framework, which shows that it is secure against any eavesdropping attempts. The performance of the proposed scheme is evaluated using simulations, which show that it outperforms existing QKD schemes in terms of key generation rate and security. The proposed scheme has the potential to be used in various applications, including secure data transmission over the internet.'
  • 'A Secure Communication Framework for IoT Devices using Quantum Key Distribution and Homomorphic Encryption\n\nAbstract: The Internet of Things (IoT) has revolutionized the way we live and work, connecting billions of devices worldwide. However, the increased connectivity also raises significant security concerns, as IoT devices are often vulnerable to cyber attacks. In this paper, we propose a secure communication framework for IoT devices using quantum key distribution (QKD) and homomorphic encryption. Our framework leverages the principles of QKD to establish secure key exchange between IoT devices, while homomorphic encryption enables secure data processing without revealing sensitive information. We demonstrate the effectiveness of our framework through simulations and experiments, showcasing its ability to provide robust security against various types of attacks. The results of this study contribute to the development of secure IoT communication systems, which are essential for the widespread adoption of IoT technology.\n\nKeywords: Quantum Key Distribution, Homomorphic Encryption, IoT Security, Secure Communication Framework\n\nI. Introduction\n\nThe Internet of Things (IoT) has transformed the way we live and work, connecting billions of devices worldwide. However, the increased connectivity also raises significant security concerns, as IoT devices are often vulnerable to cyber attacks. In this paper, we propose a secure communication framework for IoT devices using quantum key distribution (QKD) and homomorphic encryption.\n\nII. Background\n\nQuantum Key Distribution (QKD) is a method of secure key exchange that relies on the principles of quantum mechanics. QKD enables two parties to establish a shared secret key over an insecure communication channel, without revealing the key to any third party. Homomorphic encryption, on the other hand, is a type of encryption that enables computations to be performed on encrypted data without decrypting it first.\n\nIII. Proposed Framework\n\nOur proposed framework consists of two main components: QKD-based key exchange and homomorphic encryption-based data processing. The QKD component uses the BB84 protocol to establish a secure key exchange between IoT devices, while the homomorphic encryption component uses the HElib library to perform secure data processing.\n\nIV. Simulation and Experiment Results\n\nWe conducted simulations and experiments to evaluate the effectiveness of our framework. The results show that our framework provides robust security against various types of attacks, including eavesdropping and tampering attacks. We also demonstrate the ability of our framework to provide secure data processing without revealing sensitive information.\n\nV. Conclusion\n\nIn this paper, we proposed a secure communication framework for IoT devices using QKD and homomorphic encryption. Our framework provides robust security against various types of attacks and enables secure data processing without revealing sensitive information. The results of this study contribute to the development of secure IoT communication systems, which are essential for the widespread adoption of IoT technology.'
  • 'A Novel Turbo Code Design for Near-Capacity Performance in Wireless Communication Systems\nAbstract—Turbo codes have been widely adopted in various wireless communication systems due to their near-capacity performance and low complexity. In this paper, we propose a novel turbo code design that achieves better performance than the traditional turbo code. The proposed design is based on a new interleaving scheme that combines the benefits of random and systematic interleaving. Simulation results show that the proposed turbo code outperforms the traditional turbo code in terms of bit error rate and frame error rate. The proposed design is also compared with other state-of-the-art turbo code designs, and the results show that it achieves better performance. The proposed turbo code is suitable for various wireless communication systems, including 5G and beyond. The design and implementation of the proposed turbo code are discussed in detail, and the simulation results are presented to demonstrate its performance.\nKeywords—Turbo codes, interleaving scheme, near-capacity performance, wireless communication systems, 5G and beyond.'
Group Theory
  • 'Title: On the Structure of the Centralizer of a Toral Subgroup in a Reductive Algebraic Group\n\nAbstract: We study the centralizer of a toral subgroup in a reductive algebraic group over an algebraically closed field of characteristic zero. Our main result describes the structure of this centralizer in terms of the root system of the group and the weights of the toral subgroup. We also provide a characterization of the centralizer in terms of the Bruhat-Tits building of the group. Our methods involve a combination of Lie algebra techniques, including the use of the Killing form and the Cartan-Killing classification of simple Lie algebras, as well as geometric and algebraic techniques, including the use of the Bruhat-Tits building and the theory of algebraic groups over local fields.\n\nIntroduction: The centralizer of a toral subgroup in a reductive algebraic group is a fundamental object of study in the theory of algebraic groups. In this paper, we investigate the structure of this centralizer, with a focus on the case where the toral subgroup is a maximal torus. Our main result provides a detailed description of the centralizer in this case, and we also provide a characterization of the centralizer in terms of the Bruhat-Tits building of the group. We hope that our results will be of interest to researchers in the field of algebraic groups and Lie theory.\n\n1 Introduction\n\n1.1 Background and Motivation\n\n1.2 Main Results\n\n2 Preliminaries\n\n2.1 Algebraic Groups and Lie Algebras\n\n2.2 Root Systems and Weights\n\n2.3 Bruhat-Tits Buildings\n\n3 The Centralizer of a Toral Subgroup\n\n3.1 Definition and Basic Properties\n\n3.2 Structure of the Centralizer\n\n3.3 Characterization of the Centralizer\n\n4 Applications and Further Directions\n\n4.1 Applications to Algebraic Groups\n\n4.2 Further Directions\n\nReferences\n\nBibliography'
  • 'Geometric Invariant Theory (GIT) is a branch of algebraic geometry that studies the action of an algebraic group on an algebraic variety. It provides a framework for understanding the symmetries of algebraic varieties and has applications in various areas of mathematics, including commutative algebra, algebraic geometry, and representation theory. In this paper, we apply GIT to study the invariant theory of a certain algebraic group action on a projective variety. We use the Hilbert-Mumford criterion to determine the semistable points of the action and then compute the invariant ring using the Grothendieck ring of the group. Our results have implications for the study of algebraic groups and their actions on projective varieties.'
  • 'Title: A Geometric Approach to Representations of Finite Groups\n\nAbstract: We introduce a new method for constructing representations of finite groups using algebraic geometry. Our approach is based on the idea of representing a group as a quotient of a reductive group by a finite subgroup. We show that this construction yields a faithful representation of the group, and we use it to compute the character table of the symmetric group S_5. Our method has several advantages over existing methods, including the ability to handle large groups and the flexibility to incorporate additional structure. We also discuss some potential applications of our method, including the computation of representation theory for finite groups of Lie type.\n\nKeywords: representation theory, algebraic geometry, finite groups, reductive groups, symmetric group\n\nArXiv ID: 2203.10201\n\nSubmission date: 2022-03-17'
Neural and Evolutionary
  • 'Evolutionary algorithms have been widely used in various optimization problems due to their ability to efficiently search for optimal solutions. In this paper, we propose a novel hybrid approach that combines the strengths of genetic algorithms and differential evolution to solve complex optimization problems. The proposed method, called GEDE, integrates the exploration capabilities of genetic algorithms with the exploitation capabilities of differential evolution. We evaluate the performance of GEDE on several benchmark problems and compare it with other state-of-the-art algorithms. The results show that GEDE outperforms the other algorithms in terms of convergence speed and solution quality. We also analyze the convergence behavior of GEDE and provide insights into its performance. The proposed approach has the potential to be applied to a wide range of optimization problems in various fields, including engineering, economics, and computer science.'
  • 'Title: A Deep Learning Approach for Sentiment Analysis of Text Data\nAbstract: This paper proposes a novel deep learning model for sentiment analysis of text data. The proposed model combines the strengths of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to effectively capture the spatial and temporal dependencies in text data. Experimental results on several benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art methods. The proposed model achieves an accuracy of 92.5% on the IMDB dataset, outperforming the best existing method by 2.5%. The results also show that the proposed model is robust to noise and can handle out-of-vocabulary words. The proposed model is a significant contribution to the field of natural language processing and has the potential to be applied to various real-world applications.'
  • 'Title: A Novel Hybrid Approach for Deep Learning-based Optimization of Evolutionary Algorithms\n\nAbstract: This paper proposes a novel hybrid approach that combines the strengths of deep learning and evolutionary algorithms to optimize complex optimization problems. We introduce a new neural network architecture that learns to adapt the parameters of evolutionary algorithms in real-time, leading to improved convergence rates and better solution quality. Our approach is evaluated on a range of benchmark problems and compared to state-of-the-art methods. The results show that our hybrid approach outperforms existing methods in terms of convergence speed and solution quality. We also provide a comprehensive analysis of the proposed approach and discuss its potential applications in various fields.\n\nKeywords: Evolutionary algorithms, Deep learning, Optimization, Hybrid approach, Neural networks.'
Commutative Algebra
  • "Title: On the Frobenius Ideals of a Commutative Ring. Abstract: We study the Frobenius ideals of a commutative ring and provide a characterization of the Frobenius ideals in terms of the ring's structure. Our main result shows that the Frobenius ideal of a commutative ring is a finitely generated ideal if and only if the ring is a finitely generated module over its endomorphism ring. We also investigate the relationship between the Frobenius ideal and the ring's dimension. Our results have implications for the study of commutative algebra and the theory of Frobenius ideals. Keywords: Frobenius ideal, commutative ring, finitely generated ideal, endomorphism ring, dimension. Source: Google Scholar."
  • 'arXiv:2207.12345v1 [math.AG] 18 Jul 2022\n\nTitle: On the arithmetic of certain K3 surfaces\n\nAbstract: We study the arithmetic of certain K3 surfaces defined over the rationals, with a focus on their Picard groups and Neron-Severi lattices. Our main result is a complete classification of these surfaces in terms of their invariants, which we compute using a combination of algebraic geometry and number theory techniques. We also provide explicit examples of such surfaces, and discuss their implications for the study of arithmetic geometry.\n\nIntroduction\n\nLet $S$ be a K3 surface defined over the rationals, and let $Pic(S)$ denote its Picard group. The Neron-Severi lattice $NS(S)$ is the subgroup of $Pic(S)$ generated by the divisor classes of the curve $S$. Our main goal is to classify the K3 surfaces $S$ such that $NS(S)$ is isomorphic to a lattice of the form $U \times E_8$, where $U$ is a hyperbolic plane and $E_8$ is the standard $E_8$ lattice. We achieve this by first showing that such a surface must have a certain type of singular point, and then using this information to compute the invariants of $NS(S)$. Our classification result has implications for the study of arithmetic geometry, and provides a new perspective on the geometry of K3 surfaces.'
  • 'arXiv:2203.01023v1 [math.RT] 1 Mar 2022\n\nTitle: On the representation theory of the Iwahori-Hecke algebra of the symmetric group\n\nAbstract: We study the representation theory of the Iwahori-Hecke algebra of the symmetric group. Our main result is a classification of the irreducible representations of this algebra in terms of the representation theory of the symmetric group. We also provide a new proof of the fact that the Iwahori-Hecke algebra is a semisimple algebra. Our methods involve a combination of representation theory, algebraic geometry, and combinatorics.\n\n1 Introduction\n\nThe Iwahori-Hecke algebra of the symmetric group is a well-studied algebra that has connections to many areas of mathematics, including representation theory, algebraic geometry, and combinatorics. In this paper, we study the representation theory of this algebra, with a focus on classifying the irreducible representations. Our main result is a classification of the irreducible representations of the Iwahori-Hecke algebra in terms of the representation theory of the symmetric group. We also provide a new proof of the fact that the Iwahori-Hecke algebra is a semisimple algebra. Our methods involve a combination of representation theory, algebraic geometry, and combinatorics.\n\n2 Background\n\nIn this section, we provide some background on the representation theory of the symmetric group and the Iwahori-Hecke algebra. We recall the definition of the Iwahori-Hecke algebra and its properties, and we also recall some results on the representation theory of the symmetric group.\n\n3 The Representation Theory of the Iwahori-Hecke Algebra\n\nIn this section, we study the representation theory of the Iwahori-Hecke algebra. We provide a classification of the irreducible representations of this algebra in terms of the representation theory of the symmetric group. We also provide a new proof of the fact that the Iwahori-Hecke algebra is a semisimple algebra.\n\n4 Conclusion\n\nIn this paper, we have studied the representation theory of the Iwahori-Hecke algebra of the symmetric group. Our main result is a classification of the irreducible representations of this algebra in terms of the representation theory of the symmetric group. We have also provided a new proof of the fact that the Iwahori-Hecke algebra is a semisimple algebra. We believe that our results will have applications in many areas of mathematics, including representation theory, algebraic geometry, and combinatorics.\n\nReferences\n\n[1] Ariki, S. (1996). On the decomposition numbers of the Hecke algebra of the symmetric group. Journal of Algebra, 183(2), 371-394.\n[2] Ariki, S., & Koike, K. (1994). A Hecke algebra of (Z/rZ)Sn and approximation of the irreducible characters of the symmetric group. Journal of Algebra, 171(2), 311-346.\n[3] Dipper, R., & James, G. (1988). Representations of the symmetric group which are irreducible over the commutator subgroup. Mathematische Zeitschrift, 198(2), 151-166.\n[4] Green, J. A. (1955). Axiomatic approach to the representation theory of the symmetric group. Journal of Algebra, 1(2), 107-133.\n[5] James, G. (1978). The representation theory of the symmetric group. Lecture Notes in Mathematics, 682, 1-42.\n[6] Murphy, G. J. (1990). On the representation theory of the symmetric group. Journal of Algebra, 131(2), 449-465.\n[7] Nakayama, T. (1952). On the representations of the symmetric group. Journal of the Faculty of Science, University of Tokyo, 6(2), 147-172.\n[8] Sagan, B. E. (1991). The symmetric group: Representations, combinatorial algorithms, and symmetric functions. Wadsworth & Brooks/Cole.\n[9] Zelevinsky, A. (1980). Representations of the symmetric group which are irreducible over the commutator subgroup. Mathematische Zeitschrift, 173(2), 133-146.\n[10] Zelevinsky, A. (1981). Representations of the symmetric group which are irreducible over the commutator subgroup. Journal of Algebra, 71(2), 249-262.\n\n\n'
Systems and Control
  • 'Optimal Control Theory for Nonlinear Systems with Bounded Controls\n\nAbstract: This paper presents a new approach to optimal control theory for nonlinear systems with bounded controls. We propose a novel method for solving the Hamilton-Jacobi-Isaacs equation, which is a fundamental equation in optimal control theory. Our approach is based on a combination of deep learning and numerical methods, and it is capable of handling high-dimensional systems with nonlinear dynamics. We demonstrate the effectiveness of our method through numerical experiments on several benchmark problems, including a nonlinear pendulum and a nonlinear cart-pole system. Our results show that our method can achieve better performance than existing methods, and it is computationally efficient. We also provide a theoretical analysis of our method, and we show that it converges to the optimal solution under certain conditions. The proposed method has the potential to be applied to a wide range of fields, including robotics, aerospace engineering, and biomedical engineering.'
  • 'Title: A Robust Control Approach for Uncertain Systems with Time-Varying Delays\nAbstract: This paper presents a robust control strategy for uncertain systems with time-varying delays. The proposed method combines a model predictive control (MPC) scheme with a robust control approach to ensure stability and performance of the closed-loop system. The MPC scheme is designed to handle the time-varying delays, while the robust control approach ensures that the system remains stable despite the presence of uncertainties. The effectiveness of the proposed method is demonstrated through numerical simulations and experimental results on a laboratory setup. The results show that the proposed method outperforms traditional robust control approaches in terms of stability and performance. The proposed method has the potential to be applied to a wide range of uncertain systems with time-varying delays, such as those encountered in robotics, aerospace, and process control.'
  • 'H-infinity Control in Robust Control\n\nAbstract: This paper presents a novel approach to robust control design using H-infinity control theory. The proposed method combines the advantages of both H-infinity and mu-synthesis techniques to achieve improved robustness and performance. Theoretical results are supported by numerical examples and comparisons with existing methods. The proposed approach is demonstrated on a benchmark problem and shows significant improvements in terms of robust stability and performance.\n\nIntroduction\n\nH-infinity control theory has been widely used in robust control design due to its ability to provide guaranteed robust stability and performance. However, the existing methods often suffer from conservativeness and computational complexity. In this paper, we propose a novel approach that combines the advantages of H-infinity and mu-synthesis techniques to achieve improved robustness and performance. The proposed method is based on a new optimization problem formulation that takes into account the uncertainties and disturbances in the system.\n\nMethodology\n\nThe proposed approach is based on the following steps: (1) model the system using a state-space representation; (2) formulate the H-infinity control problem using the mu-synthesis framework; (3) solve the optimization problem to obtain the controller gains; and (4) implement the controller using a digital signal processor. The proposed approach is demonstrated on a benchmark problem, which is a two-input two-output system with uncertainties in the plant and disturbances in the input.\n\nResults\n\nThe proposed approach is compared with existing methods, including H-infinity control and mu-synthesis. The results show that the proposed approach achieves improved robust stability and performance, with a significant reduction in the control effort. The proposed approach is also compared with a state-of-the-art robust control method, which shows that the proposed approach outperforms the existing method in terms of robust stability and performance.\n\nConclusion\n\nIn this paper, we have presented a novel approach to robust control design using H-infinity control theory. The proposed method combines the advantages of both H-infinity and mu-synthesis techniques to achieve improved robustness and performance. Theoretical results are supported by numerical examples and comparisons with existing methods. The proposed approach is demonstrated on a benchmark problem and shows significant improvements in terms of robust stability and performance.'
Statistics Theory
  • 'Title: A New Perspective on the Generalization Error of Support Vector Machines\nAbstract: We provide a new bound on the generalization error of support vector machines (SVMs) in terms of the Rademacher complexity of the reproducing kernel Hilbert space (RKHS) of the kernel. Our bound is tighter than existing bounds and has a simpler form. We also provide a new algorithm for learning the kernel, which is based on the idea of minimizing the empirical risk with respect to the RKHS norm. We demonstrate the effectiveness of our approach on several benchmark datasets.'
  • 'Title: A Bayesian Approach to Hypothesis Testing for High-Dimensional Data\n\nAbstract: Hypothesis testing is a fundamental problem in statistics, and its applications are widespread in various fields. However, the traditional methods of hypothesis testing often fail to perform well in high-dimensional data settings. In this paper, we propose a novel Bayesian approach to hypothesis testing for high-dimensional data. Our method combines the strengths of Bayesian inference and dimensionality reduction techniques to provide a robust and efficient solution to the hypothesis testing problem. We demonstrate the effectiveness of our approach through extensive simulations and real-world experiments on high-dimensional data sets. The results show that our method outperforms existing methods in terms of accuracy and computational efficiency. Furthermore, we provide a theoretical analysis of our approach, which provides insights into its performance and limitations. Our method has the potential to be applied to a wide range of applications, including image analysis, genomics, and finance. The code and data used in this paper are available online for reproducibility purposes.'
  • 'Title: Bayesian Network Learning with Gaussian Process Priors for Uncertainty Quantification in High-Dimensional Systems\n\nAbstract: Bayesian networks are a powerful tool for modeling complex systems with uncertainty. However, in high-dimensional systems, the computational cost of learning Bayesian networks can be prohibitively expensive. In this paper, we propose a novel approach to Bayesian network learning using Gaussian process priors. Our approach, which we call Bayesian network learning with Gaussian process priors (BN-GP), leverages the flexibility of Gaussian processes to model the uncertainty in the network structure. We demonstrate the effectiveness of BN-GP on several high-dimensional systems, including a synthetic dataset and a real-world dataset from the field of systems biology. Our results show that BN-GP can learn accurate Bayesian networks with significantly reduced computational cost compared to traditional methods. Furthermore, we provide a theoretical analysis of the convergence properties of BN-GP, which shows that it can learn consistent estimates of the network structure even in the presence of high-dimensional data. Our approach has the potential to enable the widespread adoption of Bayesian networks in high-dimensional systems, where traditional methods are often infeasible.\n\nKeywords: Bayesian networks, Gaussian process priors, uncertainty quantification, high-dimensional systems, systems biology.'
Artificial Intelligence
  • 'Title: Learning Hierarchical Representations for Robust Visual Perception in Autonomous Systems\nAbstract: We propose a novel deep learning approach for visual perception in autonomous systems, which leverages hierarchical representations to improve robustness and accuracy. Our method combines a convolutional neural network (CNN) with a recurrent neural network (RNN) to learn a hierarchical representation of visual data. We evaluate our approach on several benchmark datasets and demonstrate significant improvements in performance compared to state-of-the-art methods. Our results show that the proposed approach can learn robust and accurate representations of visual data, even in the presence of significant occlusions and variations in lighting conditions. We also provide a detailed analysis of the learned representations and demonstrate their applicability to various tasks in autonomous systems. This work makes significant contributions to the field of computer vision and robotics, and has the potential to enable more robust and accurate visual perception in autonomous systems.'
  • 'Title: A Deep Learning Approach for Text Classification: A Comparative Study\n\nAbstract: Text classification is a fundamental task in natural language processing (NLP) that has numerous applications in various domains. In this paper, we propose a deep learning approach for text classification using convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We compare the performance of our proposed approach with state-of-the-art methods on several benchmark datasets. Our results show that our approach outperforms the existing methods in terms of accuracy and F1-score. We also analyze the effect of different hyperparameters on the performance of our approach and provide insights into the importance of feature extraction in text classification. This study contributes to the development of efficient and accurate text classification models using deep learning techniques.\n\nKeywords: text classification, deep learning, convolutional neural networks, recurrent neural networks, natural language processing.'
  • 'Title: Investigating the Impact of Attention Mechanisms on Deep Learning Models for Sentiment Analysis.\n\nAbstract: This paper explores the effects of incorporating attention mechanisms into deep learning models for sentiment analysis. We propose a novel architecture that combines the strengths of recurrent neural networks (RNNs) and attention mechanisms to improve the performance of sentiment analysis tasks. Our experimental results demonstrate that the proposed model outperforms state-of-the-art models in terms of accuracy and F1-score. Furthermore, we conduct an ablation study to investigate the impact of different attention mechanisms on the performance of the model. Our findings suggest that the proposed attention mechanism is more effective than other attention mechanisms in improving the performance of sentiment analysis tasks.\n\nKeywords: deep learning, attention mechanisms, sentiment analysis, natural language processing, neural networks.\n\nSource: Google Scholar.'
Computational Engineering
  • 'Title: An Efficient Seismic Inversion Method Using Deep Learning for Reservoir Characterization\nAbstract: Seismic inversion is a crucial step in reservoir characterization, and its accuracy directly affects the economic viability of hydrocarbon exploration and production. In this paper, we propose a novel seismic inversion method based on deep learning that can efficiently handle large-scale seismic data. The proposed method utilizes a convolutional neural network (CNN) to learn the mapping between seismic data and reservoir properties. We demonstrate the effectiveness of our method using a real-world dataset and show that it outperforms traditional methods in terms of accuracy and computational efficiency. Our results indicate that the proposed method can be a valuable tool for seismic inversion and reservoir characterization.\nKeywords: seismic inversion, deep learning, reservoir characterization, convolutional neural network, computational engineering.'
  • "A Novel Finite Element Method for Nonlinear Structural Analysis of Composite Materials\nAbstract: This paper presents a novel finite element method for nonlinear structural analysis of composite materials. The proposed method is based on a combination of the extended finite element method (XFEM) and the peridynamic theory (PDT). The XFEM is used to model the nonlinear behavior of the composite material, while the PDT is used to capture the long-range interactions between the material's particles. The proposed method is implemented in a computational framework and is validated using several numerical examples. The results show that the proposed method can accurately capture the nonlinear behavior of composite materials and can be used to predict the structural response of complex composite structures. The proposed method has the potential to be used in various engineering applications, including the design and analysis of composite structures for aerospace, automotive, and civil engineering."
  • 'Title: A Novel Finite Element Method for Simulating Nonlinear Dynamics in Composite Materials\n\nAbstract: This paper presents a new finite element method for simulating nonlinear dynamics in composite materials. The proposed method combines the advantages of the partition of unity method and the extended finite element method to capture the complex behavior of composite materials under various loading conditions. The numerical results show that the proposed method can accurately predict the nonlinear dynamics of composite materials, including the effects of material nonlinearity and geometric nonlinearity. The proposed method is also compared with other existing methods, and the results show that it has better accuracy and efficiency.\n\nKeywords: finite element method, nonlinear dynamics, composite materials, partition of unity method, extended finite element method.\n\nArXiv ID: 2203.03045\n\nSubmission date: 2022-03-07\n\n'
Computer Vision
  • 'A Novel Approach to Object Detection using Convolutional Neural Networks\n\nAbstract: Object detection is a fundamental task in computer vision, and its applications are vast in various fields. In this paper, we propose a novel approach to object detection using convolutional neural networks (CNNs). Our method, called Object Detection using CNNs (ODCNN), is based on a combination of region proposal networks (RPNs) and CNNs. We train the ODCNN model on the PASCAL VOC 2007 dataset and evaluate its performance on the PASCAL VOC 2012 dataset. The results show that our approach outperforms the state-of-the-art methods in terms of accuracy and speed. We also provide a detailed analysis of the ODCNN model and its components. The code for the ODCNN model is available at https://github.com/odcnn/odcnn.\n\nKeywords: Object detection, Convolutional neural networks, Region proposal networks, PASCAL VOC 2007, PASCAL VOC 2012.'
  • 'A Novel Object Recognition Framework for Autonomous Robots using Deep Learning and Computer Vision Techniques\n\nAbstract: This paper proposes a novel object recognition framework for autonomous robots that leverages the power of deep learning and computer vision techniques. The proposed framework consists of two stages: a detection stage and a recognition stage. In the detection stage, a convolutional neural network (CNN) is used to detect objects in the scene, while in the recognition stage, a recurrent neural network (RNN) is employed to recognize the detected objects. The proposed framework is evaluated on a dataset of images collected from a robotic platform, and the results show that it outperforms state-of-the-art methods in terms of accuracy and speed. The proposed framework has the potential to be used in various applications, including robotics, autonomous vehicles, and surveillance systems.\n\nKeywords: Object recognition, autonomous robots, deep learning, computer vision, convolutional neural networks, recurrent neural networks.\n\n'
  • 'A novel approach to image classification using convolutional neural networks (CNNs) is proposed in this paper. The proposed method, dubbed "Deep Image Classifier", leverages the power of CNNs to learn hierarchical features from images. Experimental results on several benchmark datasets, including CIFAR-10 and ImageNet, demonstrate the efficacy of the proposed method in achieving state-of-the-art performance. The code for the proposed method is made available on GitHub, allowing for easy reproduction and extension of the results. The contributions of this paper can be summarized as follows: (1) a novel CNN architecture is proposed, which consists of multiple convolutional and pooling layers, followed by fully connected layers; (2) a novel training strategy is proposed, which involves data augmentation and batch normalization; (3) the proposed method is evaluated on several benchmark datasets, and the results are compared with state-of-the-art methods. The results of this paper demonstrate the potential of CNNs in image classification tasks, and provide a new benchmark for future research in this area.'
Click here to inspect more.

Synthetic Data for fancyzhx/ag_news

Label Examples
Sports
  • 'Hamburg hampered by Lauth knock Hamburg SV striker Benjamin Lauth will be sidelined for up to four weeks because of complications to a fractured foot and perhaps longer if surgery is required, coach Klaus Toppmoeller said on Wednesday.'
  • 'Keane Pleads Not Guilty to Assault Charges (AP) AP - Manchester United captain Roy Keane pleaded not guilty to all three charges Thursday over an alleged confrontation with a 16-year-old boy.'
  • 'NBA Game Summary - San Antonio at Chicago Chicago, IL (Sports Network) - Tony Parker scored 17 points and had five assists to lead a balanced San Antonio attack that handed the Spurs a 91-75 victory over the Chicago Bulls at the United Center.'
Business
  • 'Forex: Dollar Falls After Fed Rate Hike NEW YORK (Reuters) - The dollar extended its losses on Tuesday after the Federal Reserve raised interest rates as expected but signaled that both inflation and inflation expectations were easing.'
  • 'Ameritrade Posts November Client Trades Ameritrade Holding Corp., a provider of brokerage services for individual investors, said Friday that daily average client trades in November reached 183,000, with 29,000 new accounts opened during the month.'
  • 'Firefox browser sees surge in use A sudden, measurable decline in market share in any product over the course of a few months says something, even if that product is one whose producer still holds about 90 of the market in question.'
World
  • 'Leaders Attend UAE President #39;s Funeral The United Arab Emirates appointed Sheik Khalifa bin Zayed Al Nahyan as its president Wednesday, hours after burying his father in a funeral that attracted thousands of mourners and nine heads of state to this desert nation on the Arabian Peninsula.'
  • 'Report: Tobacco Industry Hid Smoking Dangers NEW YORK (Reuters Health) - The tobacco industry for many years claimed that it was unaware of biological evidence that smoking is harmful to health, but that was untrue according to a medical journal report.'
  • 'Telenor urges fair regulatory system in Thailand (FT.com) FT.com - Telenor, the Norwegian telecommunications company, on Thursday called for "a level-playing field" in Thailand's mobile industry, urging a newly-established Thai telecoms regulator swiftly to create a fair new interconnection regime.'
Sci/Tech
  • 'Microsoft Takes Lead in Software For Handhelds Microsoft has unseated the Palm system with worldwide sales of more than 1.3 million units over the third quarter of the year, compared with slightly more than 850,000 for the Palm, according to a new report. <FONT face="verdana,MS Sans Serif,arial,helvetica" size="-2" color="#666666"><B>-The Washington Post</B></FONT>'
  • 'Telstra launches international Wi-fi roaming Telstra has launched Wi-fi roaming with five international wireless broadband operators giving Telstra customers travelling abroad access to WiFi hotspots in the UK (BT Group), USA (T-Mobile USA), Japan (NTT DoCoMo), Singapore (StarHub) and Malaysia (Maxis '
  • 'Passwords Fail To Defend Enterprises (TechWeb) TechWeb - Passwords, the dominant form of securing enterprise assets, are a failure, a research firm says.'
Click here to inspect more.

Synthetic Data for Language other than English

kenhktsui/chinese_sentiment_syn

Usage

You can either specify n_record_to_generate or all of the params n_source_type, n_topic, n_subtopic, sample_per_subtopic. If first is used, our implementation aims at generate such no of record, but it might not be exact because of unpredictability of LLM. If later is used, the result no of record = no of label * n_source_type * n_topic * n_subtopic * sample_per_subtopic. It may be required for some problems, for example, where you see more source_type and less in topic. If there are many labels in a classification problem, it might take a while to generate synthetic data.

Benchmarking

We tested on multiple datasets:

  • stanfordnlp/imdb
  • zeroshot/twitter-financial-news-sentiment
  • ccdv/arxiv-classification
  • lmsys/toxic-chat
  • fancyzhx/ag_news

The objective is to see if synthetic data is performing as well as real data (annotation). Full training dataset indicates the upper limit of performance as more data is available. Model performance of synthetic data is at par with/ close to that of real data, which is not bad because the testing data is usually by design more similar to training (real) data than synthetic data. We also note that synthetic data is also advantageous when class is highly imbalanced like the toxic chat problem.
Our benchmark implies the synthetic data generated is close to the distribution of test data, showing the effectiveness of this synthetic data generation approach, without using any real data.
All models finetune on sentence-transformers/paraphrase-mpnet-base-v2 (109M). The performance can be boosted by using a larger base model and generating more data.

dataset metric synthetic data generation annotation full training dataset full training reference
stanfordnlp/imdb accuracy 0.878 0.908 0.928 lvwerra/distilbert-imdb
zeroshot/twitter-financial-news-sentiment f1 (weighted) 0.631 0.676 0.866 nickmuchi/finbert-tone-finetuned-fintwitter-classification
ccdv/arxiv-classification acurracy 0.618 0.566 0.805 paper
lmsys/toxic-chat, toxicchat0124 f1 (binary) 0.362 0.00* 0.822 lmsys/toxicchat-t5-large-v1.0
fancyzhx/ag_news accuracy 0.768 0.765 0.938 fabriceyhc/bert-base-uncased-ag_news

* Out of 42 annotations, only 2 labels is positive, making learning hard.

Codes to replicate is stored in examples. We will continue to add more benchmark on other datasets.

Reference

  1. Chan, X., Wang, X., Yu, D., Mi, H., Yu, D., 2024. Scaling synthetic data creation with 1,000,000,000 personas. URL: https://arxiv.org/abs/2406.20094, arXiv:2406.20094.